PopSeqle: Software for Population Sequence data to Lower Errors

Lead Research Organisation: University of East Anglia

Department Name: Environmental Sciences

Abstract

Currently, quality control (QC) checks of NGS data typically rely on the use of a single reference genome. This makes it very difficult to identify errors in sequence and assembly and distinguish such artefacts from genuine biological phenomena such as e.g. genome rearrangements, copy number variants and aneuploidy. To date, there is no software developed for QC of whole genome population sequence data that incorporates population genetics theory to identify errors. The principal aim of this proposal is to develop a fast, user-friendly software platform (PopSeqle) to QC check of population-based sequence data. PopSeqle uses a population genetic framework to identify potentially irregularities in NGS genome / transcriptome assemblies by identifying regions with outlying values for their summary statistics (i.e. pi, FST, Ho, etc.). Such regions either represent sequence or assembly errors, or they may represent parts of the genome or transcriptome that are of genuine evolutionary genetic interest. The proposed project is timely as NGS technologies have only recently gained the capability to generate affordable genome level sequence data from many individuals. PopSeqle will be developed in the new programming language 'Julia', and it will identify errors by performing a relatively novel 'wavelet transform analyses' to locate peaks and valleys in a signal of population genetic summary statistics across the sequence space. Scientists working on genome datasets of plant and plant pathogens will test the beta version of the new software. These researchers will provide feed-back which will help us optimise the software algorithms, thereby ensuring stakeholder relevance. Finally, the software, handbook and training video will be uploaded to the TGAC website, and workshops will be organised to demonstrate the PopSeqle software to end-users, thereby promoting staff training potential and increasing value for money. This project will facilitate new interactions between research staff, postdocs, and PhD students on the NRP and elsewhere who work with NGS data of crops and crop pathogens, thereby enhancing research that is relevant to the BBSRC strategy.

Technical Summary

PopSeqle is a fast, user-friendly software tool to perform quality control (QC) checks of population-based sequence data. Uniquely, the software locates irregular sequence regions in multiple aligned genome assemblies by identifying regions with deviating values for their summary statistics (pi, Fst, Ho, Ts/Tv, dN/dS, Ka/Ks, Patterson's D, fd). Assembly artefacts and sequencing errors are identified by performing wavelet transform analyses, which locates peaks and valleys in the population genetic signal across the sequence space. Wavelet transform is a relatively new mathematical method similar to the Fourier transform, and it has been used successfully with NGS data in the past. By using population genetic theory, PopSeqle is able to quantify the signal present in multiple sequence alignments caused by evolutionary forces and separate this from the signal caused by errors. Fundamentally, evolutionary forces act on all individuals in the population (and often over considerable timescales). In contrast, errors in the sequence data do not comply with population genetic rules, and this enables us to discriminate between the population genetic signatures of evolutionary forces and errors. In addition, PopSeqle will employ a sliding-window approach to help visualise and identify outlying regions. Following this initial check, the software will then direct the user to appropriate further QC-checks and/or downstream analysis to further investigate the regions of interest. The software algorithms will be evaluated, customised and improved by running both simulated data as well as empirical datasets. The ultimate aim of the PopSeqle project is to help improve the quality of NGS data and analyses to the benefit of a wide research community.

Planned Impact

The proposed software PopSeqle has great potential to improve the quality of NGS data in the analysis of large datasets consisting of multiple individuals of one or more populations. Given that the number of de novo genome sequence assemblies is rapidly increasing, this enables us to incorporate population genetic theory in the QC checks. The PopSeqle software will set a new benchmark in QC of whole genome population sequence data by incorporating information across multiple de novo genome assemblies. The new software will fit in the existing bioinformatics pipeline in between the currently implemented QC tests based on read depth and other assembly- and sequence-quality statistics, and the downstream genetic analysis. In other words, the proposed project allows us to perform a novel QC step to check the quality of population-based whole genome sequence data. To accomplish this, we propose to incorporate population genetic theory and a wavelet transform analysis. We believe this is a very powerful approach to identify errors in genome assemblies that hereto forth have escaped unnoticed, such as for example, artefacts due to mis-assembly (which may resemble genuine genome rearrangements). Similarly, highly diverged alleles (in populations with large effective population size, or of genes under balancing selection) may end up being mapped on different scaffolds, and visa versa, copy number variants may be erroneously collapsed. Using the information of multiple de novo assemblies and analysing this in a population genetic framework will significantly improve the quality of NGS data. Over 10 years ago, the PI implemented population genetic algorithms to check genotyping accuracy of SSR loci (microsatellite loci) and developed the software Micro-Checker. That software set a new benchmark in QC of these genetic markers and received more than 5400 citations and 18000 downloads. As with the launch of Micro-Checker, we believe that the proposed PopSeqle software will fundamentally change the way in which QC of NGS data will be performed in the future.

Funded Value:

£151,362

Funded Period:

Sep 16 - Mar 18

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/N02317X/1

Principal Investigator:

Cock Van Oosterhout

Research Subject:

Genetics & development (32%)

Plant & crop science (32%)

Tools, technologies & methods (32%)

Research Topic:

Evolution & populations (32%)

Interaction with organisms (32%)

Theoretical biology (32%)

Organisations

People	ORCID iD
Cock Van Oosterhout (Principal Investigator)
Federica Di Palma (Co-Investigator)
Graham Etherington (Co-Investigator)

Publications

Author Name Title Publication Date Published

10 25 50

Lighten J (2017) Evolutionary genetics of immunological supertypes reveals two faces of the Red Queen. in Nature communications

Thilliez GJA (2019) Pathogen enrichment sequencing (PenSeq) enables population genomic studies in oomycetes. in The New phytologist

Jouet A (2019) Albugo candida race diversity, ploidy and host-associated microbes revealed using DNA sequence capture on diseased plants in the field. in The New phytologist

Mathers TC (2019) Sex-specific changes in the aphid DNA methylation landscape. in Molecular ecology

Key Findings
Impact Summary
Research Databases and Models
Research Tools and Methods
Collaboration
Software and Technical Products
Engagement Activities


Description	We are developing an analysis pipeline that will facilitate the population genetic and evolutionary analyses of whole genome sequence data. We are solving challenges that come with such "big data" analyses using a novel programming language (Julia) that has several advantages over existing languages. Our software has been used by different groups, including two groups working on different plant pathogens (oomycetes), and one group working on human parasites (cryptosporidium). We are now happy to report that this has resulted in 2 publications: Jouet, A., Saunders, D. G., McMullan, M., Ward, B., Furzer, O., Jupe, F., ... & van Oosterhout, C., Jones, J. D. G. (2018). Albugo candida race diversity, ploidy and host-associated microbes revealed using DNA sequence capture on diseased plants in the field. New Phytologist. Thilliez, G.J., Armstrong, M.R., Lim, T.Y., Baker, K., Jouet, A., Ward, B., Van Oosterhout, C., Jones, J.D., Huitema, E., Birch, P.R. and Hein, I., 2018. Pathogen enrichment sequencing (PenSeq) enables population genomic studies in oomycetes. New Phytologist. Both papers have now been published.
Exploitation Route	The software and IT infrastructure we are developing is already used by others. The analysis pipeline facilitates the population genetic and evolutionary analyses of whole genome sequence data and is being adopted by other projects now. We have set up another GitHub account for this software: https://vanoosterhoutlab.github.io/HybridCheck/
Sectors	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Education,Environment
URL	https://github.com/Ward9250


Description	The PI of this grant (Cock Van Oosterhout) gave a popular science lecture (at the Norwich Science Festival) to the general public explaining how "Next Generation Sequence" data is being used to address contemporary questions in biology (in 2019). The PI has also given several presentations and seminars over Zoom in 2020. These presentations featured the software and code developed during the BBSRC grant, including a presentation for the European Association for Zoos and Aquariums (EAZA) in December 2020, talking about genetic conservation during captive breeding. I have since given various presentations, presenting data analysis and algorithms that build this BBSRC grant.
First Year Of Impact	2019
Sector	Education,Environment
Impact Types	Cultural,Societal,Policy & public services


Title	Analysis of recombination and genetic introgression in whole genome sequence data
Description	I was contacted by a group working in Brazil on the evolutionary genomic analysis of SARS-CoV-2. They thought they had detected a novel recombinant variant of SARS-CoV-2 (a Deltacron). I used some of the technology we developed in PopSeqle to identify and confirm the genetically introgressed region.
Type Of Material	Improvements to research infrastructure
Year Produced	2023
Provided To Others?	Yes
Impact	It is now much easier to identify hybridisation between different SARS-CoV-2 variants using the methods that we described in our paper.
URL	https://doi.org/10.3390/vaccines11020212


Title	SpeedDate: Software to calculate divergence times of sequence variation within and between genomes
Description	Estimating the divergence times of alleles or haplotypes (within a single genome), and between loci (in pairs of sequences) can reveal the level of admixture, inbreeding and outbreeding in the population, and it may help to infer the effects of selection on particular loci and/or alleles. Bayesian coalescent based approaches, however, may be too computationally efficient to process whole genome sequence data. Here we introduce the software SpeedDate to calculate divergence times of DNA (and RNA) sequence data within and across genomes. The software uses a fast algorithm and a sliding window approach, and it calculates divergence times using a JC, K80, F81, HKY, or GTR correction. It produces intuitive graphs to illustrate the variation in divergence times within and across genomes, and it identifies outlier regions that are significantly more conserved or diverged. SpeedDate is a command line application written in the julia programming language, and it has an optional graphical user interface (GUI). It can analyse aligned sequences and whole genome sequence data in FASTA format of a single (diploid) individual, as well as single or multiple populations of individuals (haploid, diploid or polyploid). SpeedDate is libré software and it and its manual are free to download from http://ward9250.github.io/SpeedDate/
Type Of Material	Computer model/algorithm
Year Produced	2017
Provided To Others?	Yes
Impact	Analysis using the software are now being conducted by Agathe Jouet (The Sainsbury Lab, Norwich Research Park) on Albugo candida. Analysis using SpeedDate are also being conducted to analyse Cryptosporidium evolution in Kevin Tyler's groups (UEA)
URL	http://ward9250.github.io/SpeedDate/


Description	Collaboration on a study of the rice blast fungus with The Sainsbury Laboratory and Kobe University (Japan)
Organisation	Kobe University
Country	Japan
Sector	Academic/University
PI Contribution	We are conducting computational modelling in this collaboration on a study of the rice blast fungus with The Sainsbury Laboratory.
Collaborator Contribution	Partners provided whole genome sequence data of 9 pathogen isolates.
Impact	Project is still in progress. Multidisciplinary, involving plant-pathogen specialist, genomics scientists, evolutionary geneticists, and bioinformaticians.
Start Year	2017


Description	Collaboration on a study of the rice blast fungus with The Sainsbury Laboratory and Kobe University (Japan)
Organisation	University of Cambridge
Department	The Sainsbury Laboratory
Country	United Kingdom
Sector	Academic/University
PI Contribution	We are conducting computational modelling in this collaboration on a study of the rice blast fungus with The Sainsbury Laboratory.
Collaborator Contribution	Partners provided whole genome sequence data of 9 pathogen isolates.
Impact	Project is still in progress. Multidisciplinary, involving plant-pathogen specialist, genomics scientists, evolutionary geneticists, and bioinformaticians.
Start Year	2017


Description	Evolutionary genetic analysis of C. hominis and C. parvum
Organisation	University of East Anglia
Department	School of Medicine UEA
Country	United Kingdom
Sector	Academic/University
PI Contribution	We are performing population genetic, evolutionary genetic and phylogenetic analysis to understand the evolution of host adaptation in cryptosporium. The software SpeedDate is used to estimate divergence times between isolates.
Collaborator Contribution	1) Partner has given us access to a large database of C. hominis and C. parvum whole genome sequences that enable us to develop the software. 2) Students in Tyler's group have helped us to improve the user interface (GUI) of our software.
Impact	MS is in preparation
Start Year	2017


Description	Genome analysis of plant pathogen Albugo candida in collaboration with The Sainsbury Laboratory, NRP
Organisation	University of Cambridge
Department	The Sainsbury Laboratory
Country	United Kingdom
Sector	Academic/University
PI Contribution	Performed a population genomic analysis of MYbaits targeted genomes of the plant pathogen Albugo candida.
Collaborator Contribution	Generated sequence data using MYbaits targeted approach to reveal effectors in the genomes of the plant pathogen Albugo candida.
Impact	We have a paper submitted to New Phytologist describing our findings from the population genomic analysis of MYbaits targeted genomes of the plant pathogen Albugo candida. This is a multi-disciplinary study involving bioinformaticians, genomic scientists, evolutionary biologists and ecologists.
Start Year	2017


Title	SpeedDate is a simple tool for the estimation of coalescence times between sequences created with julia and Bio.jl
Description	SpeedDate is a simple tool for the estimation of coalescence times between sequences created with julia and Bio.jl. SpeedDate is designed to take an input file of aligned sequences in FASTA format. For each pair of sequences, it will count the number of mutations between the two sequences, and then compute a coalescence time estimate interval. In order to this it must read in sequence files, process DNA sequences and get the number of mutations or genetic distance, compute the coalescence time, and then save the output. SpeedDate wraps this process up in a command line application and an optional graphical user interface (GUI). The software and helpfiles are available at https://github.com/Ward9250/SpeedDate.jl
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	The software is currently being trialed by other research groups at the Norwich Research Park before we will write the program note.
URL	https://github.com/Ward9250/SpeedDate.jl


Description	Lecture on plant pathogens
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Postgraduate students
Results and Impact	This forms part of a lecture series for the MSc course and BSc course in Evolutionary Biology & Conservation Genetics organised and presented by CVO at the UEA. Students showed interest in this area of research.
Year(s) Of Engagement Activity	2018


Description	Population Genetics Group (PPG) meeting Cambridge 2017
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	Presented poster to illustrate and explain the use of PoPSeqL and SpeedDate software, which resulted in discussions on bioinformatics with audience.
Year(s) Of Engagement Activity	2017


Description	Seminar at the UEA
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Undergraduate students
Results and Impact	Presenting bioinformatics tools to analyse BIG data, in particular sequence databases, using novel software and approaches developed during this project.
Year(s) Of Engagement Activity	2016,2017

Abstract

Technical Summary

Planned Impact

Organisations

People

ORCID iD

Publications