PopSeqle: Software for Population Sequence data to Lower Errors

Lead Research Organisation: University of East Anglia
Department Name: Environmental Sciences

Abstract

Currently, quality control (QC) checks of NGS data typically rely on the use of a single reference genome. This makes it very difficult to identify errors in sequence and assembly and distinguish such artefacts from genuine biological phenomena such as e.g. genome rearrangements, copy number variants and aneuploidy. To date, there is no software developed for QC of whole genome population sequence data that incorporates population genetics theory to identify errors. The principal aim of this proposal is to develop a fast, user-friendly software platform (PopSeqle) to QC check of population-based sequence data. PopSeqle uses a population genetic framework to identify potentially irregularities in NGS genome / transcriptome assemblies by identifying regions with outlying values for their summary statistics (i.e. pi, FST, Ho, etc.). Such regions either represent sequence or assembly errors, or they may represent parts of the genome or transcriptome that are of genuine evolutionary genetic interest. The proposed project is timely as NGS technologies have only recently gained the capability to generate affordable genome level sequence data from many individuals. PopSeqle will be developed in the new programming language 'Julia', and it will identify errors by performing a relatively novel 'wavelet transform analyses' to locate peaks and valleys in a signal of population genetic summary statistics across the sequence space. Scientists working on genome datasets of plant and plant pathogens will test the beta version of the new software. These researchers will provide feed-back which will help us optimise the software algorithms, thereby ensuring stakeholder relevance. Finally, the software, handbook and training video will be uploaded to the TGAC website, and workshops will be organised to demonstrate the PopSeqle software to end-users, thereby promoting staff training potential and increasing value for money. This project will facilitate new interactions between research staff, postdocs, and PhD students on the NRP and elsewhere who work with NGS data of crops and crop pathogens, thereby enhancing research that is relevant to the BBSRC strategy.

Technical Summary

PopSeqle is a fast, user-friendly software tool to perform quality control (QC) checks of population-based sequence data. Uniquely, the software locates irregular sequence regions in multiple aligned genome assemblies by identifying regions with deviating values for their summary statistics (pi, Fst, Ho, Ts/Tv, dN/dS, Ka/Ks, Patterson's D, fd). Assembly artefacts and sequencing errors are identified by performing wavelet transform analyses, which locates peaks and valleys in the population genetic signal across the sequence space. Wavelet transform is a relatively new mathematical method similar to the Fourier transform, and it has been used successfully with NGS data in the past. By using population genetic theory, PopSeqle is able to quantify the signal present in multiple sequence alignments caused by evolutionary forces and separate this from the signal caused by errors. Fundamentally, evolutionary forces act on all individuals in the population (and often over considerable timescales). In contrast, errors in the sequence data do not comply with population genetic rules, and this enables us to discriminate between the population genetic signatures of evolutionary forces and errors. In addition, PopSeqle will employ a sliding-window approach to help visualise and identify outlying regions. Following this initial check, the software will then direct the user to appropriate further QC-checks and/or downstream analysis to further investigate the regions of interest. The software algorithms will be evaluated, customised and improved by running both simulated data as well as empirical datasets. The ultimate aim of the PopSeqle project is to help improve the quality of NGS data and analyses to the benefit of a wide research community.

Planned Impact

The proposed software PopSeqle has great potential to improve the quality of NGS data in the analysis of large datasets consisting of multiple individuals of one or more populations. Given that the number of de novo genome sequence assemblies is rapidly increasing, this enables us to incorporate population genetic theory in the QC checks. The PopSeqle software will set a new benchmark in QC of whole genome population sequence data by incorporating information across multiple de novo genome assemblies. The new software will fit in the existing bioinformatics pipeline in between the currently implemented QC tests based on read depth and other assembly- and sequence-quality statistics, and the downstream genetic analysis. In other words, the proposed project allows us to perform a novel QC step to check the quality of population-based whole genome sequence data. To accomplish this, we propose to incorporate population genetic theory and a wavelet transform analysis. We believe this is a very powerful approach to identify errors in genome assemblies that hereto forth have escaped unnoticed, such as for example, artefacts due to mis-assembly (which may resemble genuine genome rearrangements). Similarly, highly diverged alleles (in populations with large effective population size, or of genes under balancing selection) may end up being mapped on different scaffolds, and visa versa, copy number variants may be erroneously collapsed. Using the information of multiple de novo assemblies and analysing this in a population genetic framework will significantly improve the quality of NGS data. Over 10 years ago, the PI implemented population genetic algorithms to check genotyping accuracy of SSR loci (microsatellite loci) and developed the software Micro-Checker. That software set a new benchmark in QC of these genetic markers and received more than 5400 citations and 18000 downloads. As with the launch of Micro-Checker, we believe that the proposed PopSeqle software will fundamentally change the way in which QC of NGS data will be performed in the future.
 
Description We are developing an analysis pipeline that will facilitate the population genetic and evolutionary analyses of whole genome sequence data. We are solving challenges that come with such "big data" analyses using a novel programming language (Julia) that has several advantages over existing languages. Our software has been used by different groups, including two groups working on different plant pathogens (oomycetes), and one group working on human parasites (cryptosporidium).
We are now happy to report that this has resulted in 2 publications:
Jouet, A., Saunders, D. G., McMullan, M., Ward, B., Furzer, O., Jupe, F., ... & van Oosterhout, C., Jones, J. D. G. (2018). Albugo candida race diversity, ploidy and host-associated microbes revealed using DNA sequence capture on diseased plants in the field. New Phytologist.
Thilliez, G.J., Armstrong, M.R., Lim, T.Y., Baker, K., Jouet, A., Ward, B., Van Oosterhout, C., Jones, J.D., Huitema, E., Birch, P.R. and Hein, I., 2018. Pathogen enrichment sequencing (PenSeq) enables population genomic studies in oomycetes. New Phytologist.
Both papers have now been published.
Exploitation Route The software and IT infrastructure we are developing is already used by others. The analysis pipeline facilitates the population genetic and evolutionary analyses of whole genome sequence data and is being adopted by other projects now. We have set up another GitHub account for this software: https://vanoosterhoutlab.github.io/HybridCheck/
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Education,Environment

URL https://github.com/Ward9250
 
Description The PI of this grant (Cock Van Oosterhout) gave a popular science lecture (at the Norwich Science Festival) to the general public explaining how "Next Generation Sequence" data is being used to address contemporary questions in biology (in 2019). The PI has also given several presentations and seminars over Zoom in 2020. These presentations featured the software and code developed during the BBSRC grant, including a presentation for the European Association for Zoos and Aquariums (EAZA) in December 2020, talking about genetic conservation during captive breeding. I have since given various presentations, presenting data analysis and algorithms that build this BBSRC grant.
First Year Of Impact 2019
Sector Education,Environment
Impact Types Cultural,Societal,Policy & public services

 
Title Analysis of recombination and genetic introgression in whole genome sequence data 
Description I was contacted by a group working in Brazil on the evolutionary genomic analysis of SARS-CoV-2. They thought they had detected a novel recombinant variant of SARS-CoV-2 (a Deltacron). I used some of the technology we developed in PopSeqle to identify and confirm the genetically introgressed region. 
Type Of Material Improvements to research infrastructure 
Year Produced 2023 
Provided To Others? Yes  
Impact It is now much easier to identify hybridisation between different SARS-CoV-2 variants using the methods that we described in our paper. 
URL https://doi.org/10.3390/vaccines11020212
 
Title SpeedDate: Software to calculate divergence times of sequence variation within and between genomes 
Description Estimating the divergence times of alleles or haplotypes (within a single genome), and between loci (in pairs of sequences) can reveal the level of admixture, inbreeding and outbreeding in the population, and it may help to infer the effects of selection on particular loci and/or alleles. Bayesian coalescent based approaches, however, may be too computationally efficient to process whole genome sequence data. Here we introduce the software SpeedDate to calculate divergence times of DNA (and RNA) sequence data within and across genomes. The software uses a fast algorithm and a sliding window approach, and it calculates divergence times using a JC, K80, F81, HKY, or GTR correction. It produces intuitive graphs to illustrate the variation in divergence times within and across genomes, and it identifies outlier regions that are significantly more conserved or diverged. SpeedDate is a command line application written in the julia programming language, and it has an optional graphical user interface (GUI). It can analyse aligned sequences and whole genome sequence data in FASTA format of a single (diploid) individual, as well as single or multiple populations of individuals (haploid, diploid or polyploid). SpeedDate is libré software and it and its manual are free to download from http://ward9250.github.io/SpeedDate/ 
Type Of Material Computer model/algorithm 
Year Produced 2017 
Provided To Others? Yes  
Impact Analysis using the software are now being conducted by Agathe Jouet (The Sainsbury Lab, Norwich Research Park) on Albugo candida. Analysis using SpeedDate are also being conducted to analyse Cryptosporidium evolution in Kevin Tyler's groups (UEA) 
URL http://ward9250.github.io/SpeedDate/
 
Description Collaboration on a study of the rice blast fungus with The Sainsbury Laboratory and Kobe University (Japan) 
Organisation Kobe University
Country Japan 
Sector Academic/University 
PI Contribution We are conducting computational modelling in this collaboration on a study of the rice blast fungus with The Sainsbury Laboratory.
Collaborator Contribution Partners provided whole genome sequence data of 9 pathogen isolates.
Impact Project is still in progress. Multidisciplinary, involving plant-pathogen specialist, genomics scientists, evolutionary geneticists, and bioinformaticians.
Start Year 2017
 
Description Collaboration on a study of the rice blast fungus with The Sainsbury Laboratory and Kobe University (Japan) 
Organisation University of Cambridge
Department The Sainsbury Laboratory
Country United Kingdom 
Sector Academic/University 
PI Contribution We are conducting computational modelling in this collaboration on a study of the rice blast fungus with The Sainsbury Laboratory.
Collaborator Contribution Partners provided whole genome sequence data of 9 pathogen isolates.
Impact Project is still in progress. Multidisciplinary, involving plant-pathogen specialist, genomics scientists, evolutionary geneticists, and bioinformaticians.
Start Year 2017
 
Description Evolutionary genetic analysis of C. hominis and C. parvum 
Organisation University of East Anglia
Department School of Medicine UEA
Country United Kingdom 
Sector Academic/University 
PI Contribution We are performing population genetic, evolutionary genetic and phylogenetic analysis to understand the evolution of host adaptation in cryptosporium. The software SpeedDate is used to estimate divergence times between isolates.
Collaborator Contribution 1) Partner has given us access to a large database of C. hominis and C. parvum whole genome sequences that enable us to develop the software. 2) Students in Tyler's group have helped us to improve the user interface (GUI) of our software.
Impact MS is in preparation
Start Year 2017
 
Description Genome analysis of plant pathogen Albugo candida in collaboration with The Sainsbury Laboratory, NRP 
Organisation University of Cambridge
Department The Sainsbury Laboratory
Country United Kingdom 
Sector Academic/University 
PI Contribution Performed a population genomic analysis of MYbaits targeted genomes of the plant pathogen Albugo candida.
Collaborator Contribution Generated sequence data using MYbaits targeted approach to reveal effectors in the genomes of the plant pathogen Albugo candida.
Impact We have a paper submitted to New Phytologist describing our findings from the population genomic analysis of MYbaits targeted genomes of the plant pathogen Albugo candida. This is a multi-disciplinary study involving bioinformaticians, genomic scientists, evolutionary biologists and ecologists.
Start Year 2017
 
Title SpeedDate is a simple tool for the estimation of coalescence times between sequences created with julia and Bio.jl 
Description SpeedDate is a simple tool for the estimation of coalescence times between sequences created with julia and Bio.jl. SpeedDate is designed to take an input file of aligned sequences in FASTA format. For each pair of sequences, it will count the number of mutations between the two sequences, and then compute a coalescence time estimate interval. In order to this it must read in sequence files, process DNA sequences and get the number of mutations or genetic distance, compute the coalescence time, and then save the output. SpeedDate wraps this process up in a command line application and an optional graphical user interface (GUI). The software and helpfiles are available at https://github.com/Ward9250/SpeedDate.jl 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact The software is currently being trialed by other research groups at the Norwich Research Park before we will write the program note. 
URL https://github.com/Ward9250/SpeedDate.jl
 
Description Lecture on plant pathogens 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact This forms part of a lecture series for the MSc course and BSc course in Evolutionary Biology & Conservation Genetics organised and presented by CVO at the UEA. Students showed interest in this area of research.
Year(s) Of Engagement Activity 2018
 
Description Population Genetics Group (PPG) meeting Cambridge 2017 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Presented poster to illustrate and explain the use of PoPSeqL and SpeedDate software, which resulted in discussions on bioinformatics with audience.
Year(s) Of Engagement Activity 2017
 
Description Seminar at the UEA 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Undergraduate students
Results and Impact Presenting bioinformatics tools to analyse BIG data, in particular sequence databases, using novel software and approaches developed during this project.
Year(s) Of Engagement Activity 2016,2017