Inferring ancestry and relatedness of human genomes using ancient DNA samples' and falls within the EPSRC Artificial Intelligence and Healthcare Techn

Lead Research Organisation: University of Oxford

Abstract

At every genomic position, two individuals are connected through genealogical relationships that lead to a common ancestor. The chronological distance from the individuals to this ancestor is termed time to the most recent common ancestor (TMRCA). This can be generalized to a set of individuals by representing their genealogical relationships by a tree. Moving along the genome, the topology of the trees can change as the genome is broken up by recombination during meiosis. Hence, the evolutionary history of a set of samples can be compactly represented by a graph, called the ancestral recombination graph (ARG), comprised by the individual trees spanning different chunks of the genome.

There are multiple computationally intensive methods to reconstruct the ARG from high-quality sequencing data of modern DNA samples. Ancient samples are of degraded quality due to environmental conditions and contamination; usually they can only be sequenced at very low coverage making the task of incorporating them into the ARG of a set of modern samples challenging.

Reconstructing an accurate joint ARG between modern and ancient DNA samples has multiple potential applications that we aim to explore. It can be used for ancestry inference, by recovering the ancestry proportion a modern sample inherits from various ancient ancestral groups, enabling us to reconstruct historical events such as population migrations. We can further exploit ARG topology to detect natural selection, by locating regions of the genome that are unusually shared from certain individuals or ancient groups . Finding regions under positive or negative selection, particularly with known biological functionality, can be especially useful in healthcare-related applications and such regions have, for example, been leveraged to determine drug targets in pharmaceutical settings. Finally, the phenotypic impact of variants can be evaluated by testing whether ancestry from certain groups is more closely related to certain phenotypes.

The project's first goal is to build a relatedness inference algorithm that can infer tree topology and TMRCAs between modern and ancient samples and use it to reconstruct a joint ARG with data from the UK BioBank and other sources. Long-range chromosomal regions that are shared across pairs of samples are informative for this analysis but hard to detect in low coverage ancient DNA, so our algorithm will need to to implicitly or explicitly model haplotype sharing despite the lack of phasing information, or in the presence of noisy computational phasing.

For this algorithm, we leverage Deep Learning (DL), which has transformed many scientific fields in the past decade. Population genetics has traditionally focused on developing complex parametric models and has not yet significantly benefited from DL advances. Sequencing data has a spatial structure, so sequences from multiple samples can be stacked to form an image and analysed using computer vision approaches (e.g. Convolutional Neural Networks),adjusting for the fact that sample order is irrelevant (i.e. require exchangeable networks). As we already have access to an ARG for modern samples, our aim is to explore the use of attention- and graph-based methods to extract ARG information that will help infer ancestry between modern and ancient samples.

Overall, we expect to make two main contributions. The first is algorithmic development that will allow the use of DL for reconstructing joint genealogical trees for modern and ancient DNA samples, tackling quality issues for the latter. The second is joint ARG inference using real UK Biobank and ancient data. We then aim to analyse this ARG to answer questions relating to natural selection and phenotypic impact of having ancestry from certain ancient groups.

Planned Impact

The primary CDT impact will be training 75 PhD graduates as the next generation of leaders in statistics and statistical machine learning. These graduates will lead in industry, government, health care, and academic research. They will bridge the gap between academia and industry, resulting in significant knowledge transfer to both established and start-up companies. Because this cohort will also learn to mentor other researchers, the CDT will ultimately address a UK-wide skills gap. The students will also be crucial in keeping the UK at the forefront of methodological research in statistics and machine learning.
After graduating, students will act as multipliers, educating others in advanced methodology throughout their career. There are a range of further impacts:
- The CDT has a large number of high calibre external partners in government, health care, industry and science. These partnerships will catalyse immediate knowledge transfer, bringing cutting edge methodology to a large number of areas. Knowledge transfer will also be achieved through internships/placements of our students with users of statistics and machine learning.
- Our Women in Mathematics and Statistics summer programme is aimed at students who could go on to apply for a PhD. This programme will inspire the next generation of statisticians and also provide excellent leadership training for the CDT students.
- The students will develop new methodology and theory in the domains of statistics and statistical machine learning. It will be relevant research, addressing the key questions behind real world problems. The research will be published in the best possible statistics journals and machine learning conferences and will be made available online. To maximize reproducibility and replicability, source code and replication files will be made available as open source software or, when relevant to an industrial collaboration, held as a patent or software copyright.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S023151/1 01/04/2019 30/09/2027
2420820 Studentship EP/S023151/1 01/10/2020 30/09/2024 Zoi Tsangalidou