Novel Statistical methods for extracting information from genetic data

Lead Research Organisation: University of Nottingham
Department Name: Sch of Mathematical Sciences

Abstract

This mathematical modelling project involves a combination of analysis of genetic data taken from plants and the development of novel mathematical and statistical tools to extract information from the data.
This project involves applying of ideas from statistical physics and information theory to data on genotype-phenotype variations. In both disciplines, the concepts of entropy and the distribution of microstates is useful in theories which describe macroscopic quantities. We consider a sample from the population, which is then ranked by phenotype (that is, an observable characteristic, such as height or weight). By analysing the respective order of genotypes (AA,Aa,aa), we derive an effective field which quantifies how strongly one particular single nucleotide polymorphism (SNP - a minor genetic mutation) influences a particular phenotype. This provides an alternative method for analysing genotype-phenotype interactions, which is more powerful than the classic genome wide association studies (GWAS), and does not rely on the statistical assumptions made by Fisher in 1901.
In this PhD project this new method will be applied to Arabidopsis to understand how the genetic states of individuals influence observed phenotypes such as how ions uptake in Arabidopsis influence plant growth. Data comes from ionomic studies performed by Sian Bray, who studies the transport of a range of ions in plants. We will typically be concerned with three bi-allelic states and the impact of these states on a continuous range of phenotype values. We will then consider the effects of multiple genes on a single phenotype.
The method quantifies the strength of the genotype-phenotype dependency; whilst many SNPs will have no significant effect on phenotype, some will, and to varying extents. We will investigate linkage disequilibrium, that is, how the strength of interaction varies as one moves along the chromosomes to nearby SNPs. Typically, where there is a SNP with a strong phenotype effect, other SNPs which are close by on the chromosome are also seen to have significant impact on phenotype. We will investigate the range of this interaction, and construct models to explain how rapidly this decays with distance from the most significant SNP.
Whilst the ultimate goal is to interpret the plant-data, and other real data sets, new methods will be tested on synthetic data, in order to test their deductive power.
We will investigate generalisations of the model, for example, the phenotypes may only available as `binned' data, that is numbers of each genotype (AA,Aa,aa) in a sequence of ranges (eg 10-20, 20-30, 30-40 etc). In this case the data has less 'power' and information, so might be less likely to be flagged as significant, but since it is smoother, it may give a simpler form for the field-strength.
We will also consider multi-allelic systems - that is, where more than three alleles are present. The mathematical analysis of such higher-dimensional systems is significantly more complicated, as the number of possible arrangements of the ranked list becomes increasingly large.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
BB/T008369/1 01/10/2020 30/09/2028
2744324 Studentship BB/T008369/1 01/10/2022 30/09/2026