PopSeqle: Software for Population Sequence data to Lower Errors
Lead Research Organisation:
University of East Anglia
Department Name: Environmental Sciences
Abstract
Currently, quality control (QC) checks of NGS data typically rely on the use of a single reference genome. This makes it very difficult to identify errors in sequence and assembly and distinguish such artefacts from genuine biological phenomena such as e.g. genome rearrangements, copy number variants and aneuploidy. To date, there is no software developed for QC of whole genome population sequence data that incorporates population genetics theory to identify errors. The principal aim of this proposal is to develop a fast, user-friendly software platform (PopSeqle) to QC check of population-based sequence data. PopSeqle uses a population genetic framework to identify potentially irregularities in NGS genome / transcriptome assemblies by identifying regions with outlying values for their summary statistics (i.e. pi, FST, Ho, etc.). Such regions either represent sequence or assembly errors, or they may represent parts of the genome or transcriptome that are of genuine evolutionary genetic interest. The proposed project is timely as NGS technologies have only recently gained the capability to generate affordable genome level sequence data from many individuals. PopSeqle will be developed in the new programming language 'Julia', and it will identify errors by performing a relatively novel 'wavelet transform analyses' to locate peaks and valleys in a signal of population genetic summary statistics across the sequence space. Scientists working on genome datasets of plant and plant pathogens will test the beta version of the new software. These researchers will provide feed-back which will help us optimise the software algorithms, thereby ensuring stakeholder relevance. Finally, the software, handbook and training video will be uploaded to the TGAC website, and workshops will be organised to demonstrate the PopSeqle software to end-users, thereby promoting staff training potential and increasing value for money. This project will facilitate new interactions between research staff, postdocs, and PhD students on the NRP and elsewhere who work with NGS data of crops and crop pathogens, thereby enhancing research that is relevant to the BBSRC strategy.
Technical Summary
PopSeqle is a fast, user-friendly software tool to perform quality control (QC) checks of population-based sequence data. Uniquely, the software locates irregular sequence regions in multiple aligned genome assemblies by identifying regions with deviating values for their summary statistics (pi, Fst, Ho, Ts/Tv, dN/dS, Ka/Ks, Patterson's D, fd). Assembly artefacts and sequencing errors are identified by performing wavelet transform analyses, which locates peaks and valleys in the population genetic signal across the sequence space. Wavelet transform is a relatively new mathematical method similar to the Fourier transform, and it has been used successfully with NGS data in the past. By using population genetic theory, PopSeqle is able to quantify the signal present in multiple sequence alignments caused by evolutionary forces and separate this from the signal caused by errors. Fundamentally, evolutionary forces act on all individuals in the population (and often over considerable timescales). In contrast, errors in the sequence data do not comply with population genetic rules, and this enables us to discriminate between the population genetic signatures of evolutionary forces and errors. In addition, PopSeqle will employ a sliding-window approach to help visualise and identify outlying regions. Following this initial check, the software will then direct the user to appropriate further QC-checks and/or downstream analysis to further investigate the regions of interest. The software algorithms will be evaluated, customised and improved by running both simulated data as well as empirical datasets. The ultimate aim of the PopSeqle project is to help improve the quality of NGS data and analyses to the benefit of a wide research community.
Planned Impact
The proposed software PopSeqle has great potential to improve the quality of NGS data in the analysis of large datasets consisting of multiple individuals of one or more populations. Given that the number of de novo genome sequence assemblies is rapidly increasing, this enables us to incorporate population genetic theory in the QC checks. The PopSeqle software will set a new benchmark in QC of whole genome population sequence data by incorporating information across multiple de novo genome assemblies. The new software will fit in the existing bioinformatics pipeline in between the currently implemented QC tests based on read depth and other assembly- and sequence-quality statistics, and the downstream genetic analysis. In other words, the proposed project allows us to perform a novel QC step to check the quality of population-based whole genome sequence data. To accomplish this, we propose to incorporate population genetic theory and a wavelet transform analysis. We believe this is a very powerful approach to identify errors in genome assemblies that hereto forth have escaped unnoticed, such as for example, artefacts due to mis-assembly (which may resemble genuine genome rearrangements). Similarly, highly diverged alleles (in populations with large effective population size, or of genes under balancing selection) may end up being mapped on different scaffolds, and visa versa, copy number variants may be erroneously collapsed. Using the information of multiple de novo assemblies and analysing this in a population genetic framework will significantly improve the quality of NGS data. Over 10 years ago, the PI implemented population genetic algorithms to check genotyping accuracy of SSR loci (microsatellite loci) and developed the software Micro-Checker. That software set a new benchmark in QC of these genetic markers and received more than 5400 citations and 18000 downloads. As with the launch of Micro-Checker, we believe that the proposed PopSeqle software will fundamentally change the way in which QC of NGS data will be performed in the future.
Publications
Mathers TC
(2019)
Sex-specific changes in the aphid DNA methylation landscape.
in Molecular ecology
Lighten J
(2017)
Evolutionary genetics of immunological supertypes reveals two faces of the Red Queen.
in Nature communications
Nader J
(2019)
Evolutionary genomics of anthroponosis in Cryptosporidium
in Nature Microbiology
Thilliez GJA
(2019)
Pathogen enrichment sequencing (PenSeq) enables population genomic studies in oomycetes.
in The New phytologist
Jouet A
(2019)
Albugo candida race diversity, ploidy and host-associated microbes revealed using DNA sequence capture on diseased plants in the field.
in The New phytologist
Description | We are developing an analysis pipeline that will facilitate the population genetic and evolutionary analyses of whole genome sequence data. We are solving challenges that come with such "big data" analyses using a novel programming language (Julia) that has several advantages over existing languages. Our software has been used by different groups, including two groups working on different plant pathogens (oomycetes), and one group working on human parasites (cryptosporidium). We are now happy to report that this has resulted in 2 publications: Jouet, A., Saunders, D. G., McMullan, M., Ward, B., Furzer, O., Jupe, F., ... & van Oosterhout, C., Jones, J. D. G. (2018). Albugo candida race diversity, ploidy and host-associated microbes revealed using DNA sequence capture on diseased plants in the field. New Phytologist. Thilliez, G.J., Armstrong, M.R., Lim, T.Y., Baker, K., Jouet, A., Ward, B., Van Oosterhout, C., Jones, J.D., Huitema, E., Birch, P.R. and Hein, I., 2018. Pathogen enrichment sequencing (PenSeq) enables population genomic studies in oomycetes. New Phytologist. Both papers have now been published. |
Exploitation Route | The software and IT infrastructure we are developing is already used by others. The analysis pipeline facilitates the population genetic and evolutionary analyses of whole genome sequence data and is being adopted by other projects now. We have set up another GitHub account for this software: https://vanoosterhoutlab.github.io/HybridCheck/ |
Sectors | Agriculture Food and Drink Digital/Communication/Information Technologies (including Software) Education Environment |
URL | https://github.com/Ward9250 |
Description | The PI of this grant (Cock Van Oosterhout) gave a popular science lecture (at the Norwich Science Festival) to the general public explaining how "Next Generation Sequence" data is being used to address contemporary questions in biology (in 2019). The PI has also given several presentations and seminars over Zoom in 2020. These presentations featured the software and code developed during the BBSRC grant, including a presentation for the European Association for Zoos and Aquariums (EAZA) in December 2020, talking about genetic conservation during captive breeding. I have since given various presentations, presenting data analysis and algorithms that build this BBSRC grant. |
First Year Of Impact | 2019 |
Sector | Education,Environment |
Impact Types | Cultural Societal Policy & public services |
Title | Analysis of recombination and genetic introgression in whole genome sequence data |
Description | I was contacted by a group working in Brazil on the evolutionary genomic analysis of SARS-CoV-2. They thought they had detected a novel recombinant variant of SARS-CoV-2 (a Deltacron). I used some of the technology we developed in PopSeqle to identify and confirm the genetically introgressed region. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2023 |
Provided To Others? | Yes |
Impact | It is now much easier to identify hybridisation between different SARS-CoV-2 variants using the methods that we described in our paper. |
URL | https://doi.org/10.3390/vaccines11020212 |
Title | SpeedDate: Software to calculate divergence times of sequence variation within and between genomes |
Description | Estimating the divergence times of alleles or haplotypes (within a single genome), and between loci (in pairs of sequences) can reveal the level of admixture, inbreeding and outbreeding in the population, and it may help to infer the effects of selection on particular loci and/or alleles. Bayesian coalescent based approaches, however, may be too computationally efficient to process whole genome sequence data. Here we introduce the software SpeedDate to calculate divergence times of DNA (and RNA) sequence data within and across genomes. The software uses a fast algorithm and a sliding window approach, and it calculates divergence times using a JC, K80, F81, HKY, or GTR correction. It produces intuitive graphs to illustrate the variation in divergence times within and across genomes, and it identifies outlier regions that are significantly more conserved or diverged. SpeedDate is a command line application written in the julia programming language, and it has an optional graphical user interface (GUI). It can analyse aligned sequences and whole genome sequence data in FASTA format of a single (diploid) individual, as well as single or multiple populations of individuals (haploid, diploid or polyploid). SpeedDate is libré software and it and its manual are free to download from http://ward9250.github.io/SpeedDate/ |
Type Of Material | Computer model/algorithm |
Year Produced | 2017 |
Provided To Others? | Yes |
Impact | Analysis using the software are now being conducted by Agathe Jouet (The Sainsbury Lab, Norwich Research Park) on Albugo candida. Analysis using SpeedDate are also being conducted to analyse Cryptosporidium evolution in Kevin Tyler's groups (UEA) |
URL | http://ward9250.github.io/SpeedDate/ |
Description | Collaboration on a study of the rice blast fungus with The Sainsbury Laboratory and Kobe University (Japan) |
Organisation | Kobe University |
Country | Japan |
Sector | Academic/University |
PI Contribution | We are conducting computational modelling in this collaboration on a study of the rice blast fungus with The Sainsbury Laboratory. |
Collaborator Contribution | Partners provided whole genome sequence data of 9 pathogen isolates. |
Impact | Project is still in progress. Multidisciplinary, involving plant-pathogen specialist, genomics scientists, evolutionary geneticists, and bioinformaticians. |
Start Year | 2017 |
Description | Collaboration on a study of the rice blast fungus with The Sainsbury Laboratory and Kobe University (Japan) |
Organisation | University of Cambridge |
Department | The Sainsbury Laboratory |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | We are conducting computational modelling in this collaboration on a study of the rice blast fungus with The Sainsbury Laboratory. |
Collaborator Contribution | Partners provided whole genome sequence data of 9 pathogen isolates. |
Impact | Project is still in progress. Multidisciplinary, involving plant-pathogen specialist, genomics scientists, evolutionary geneticists, and bioinformaticians. |
Start Year | 2017 |
Description | Evolutionary genetic analysis of C. hominis and C. parvum |
Organisation | University of East Anglia |
Department | School of Medicine UEA |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | We are performing population genetic, evolutionary genetic and phylogenetic analysis to understand the evolution of host adaptation in cryptosporium. The software SpeedDate is used to estimate divergence times between isolates. |
Collaborator Contribution | 1) Partner has given us access to a large database of C. hominis and C. parvum whole genome sequences that enable us to develop the software. 2) Students in Tyler's group have helped us to improve the user interface (GUI) of our software. |
Impact | MS is in preparation |
Start Year | 2017 |
Description | Genome analysis of plant pathogen Albugo candida in collaboration with The Sainsbury Laboratory, NRP |
Organisation | University of Cambridge |
Department | The Sainsbury Laboratory |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Performed a population genomic analysis of MYbaits targeted genomes of the plant pathogen Albugo candida. |
Collaborator Contribution | Generated sequence data using MYbaits targeted approach to reveal effectors in the genomes of the plant pathogen Albugo candida. |
Impact | We have a paper submitted to New Phytologist describing our findings from the population genomic analysis of MYbaits targeted genomes of the plant pathogen Albugo candida. This is a multi-disciplinary study involving bioinformaticians, genomic scientists, evolutionary biologists and ecologists. |
Start Year | 2017 |
Title | SpeedDate is a simple tool for the estimation of coalescence times between sequences created with julia and Bio.jl |
Description | SpeedDate is a simple tool for the estimation of coalescence times between sequences created with julia and Bio.jl. SpeedDate is designed to take an input file of aligned sequences in FASTA format. For each pair of sequences, it will count the number of mutations between the two sequences, and then compute a coalescence time estimate interval. In order to this it must read in sequence files, process DNA sequences and get the number of mutations or genetic distance, compute the coalescence time, and then save the output. SpeedDate wraps this process up in a command line application and an optional graphical user interface (GUI). The software and helpfiles are available at https://github.com/Ward9250/SpeedDate.jl |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | The software is currently being trialed by other research groups at the Norwich Research Park before we will write the program note. |
URL | https://github.com/Ward9250/SpeedDate.jl |
Description | Lecture on plant pathogens |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Postgraduate students |
Results and Impact | This forms part of a lecture series for the MSc course and BSc course in Evolutionary Biology & Conservation Genetics organised and presented by CVO at the UEA. Students showed interest in this area of research. |
Year(s) Of Engagement Activity | 2018 |
Description | Population Genetics Group (PPG) meeting Cambridge 2017 |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | Presented poster to illustrate and explain the use of PoPSeqL and SpeedDate software, which resulted in discussions on bioinformatics with audience. |
Year(s) Of Engagement Activity | 2017 |
Description | Seminar at the UEA |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Undergraduate students |
Results and Impact | Presenting bioinformatics tools to analyse BIG data, in particular sequence databases, using novel software and approaches developed during this project. |
Year(s) Of Engagement Activity | 2016,2017 |