# High-dimensional mixture model selection and alternative splicing

Lead Research Organisation:
University of Warwick

Department Name: Statistics

### Abstract

Suppose that you are sitting alone at a table in a crowded restaurant where you can hear a mixture of several conversations. This mixed sound is not particularly interesting, but suppose you could record it and then use some sort of algorithm to extract the individual conversations that were going on in the restaurant, that would certainly be much more informative. In statistical jargon this is called a "mixture model", and there is a surprisingly large number of real-life situations where they are very useful. As a motivating application we consider an important biomedical problem called alternative splicing. Although all humans have the same genes encoded in our DNA, it turns out that each of our genes can be expressed in several variations (called splicing variants), and that each of these variations performs different functions in the organism; some may even help cause complex neurodegenerative diseases or cancer. Fortunately, technologies from recent years produce data that allow us for the first time to study this phenomenon in detail. We can now observe the overall expression of the gene from which, similar to the restaurant example, we would like to learn what are the individual contributions of each gene variant (indeed, to learn whether a given variant was even present at all). These technologies are becoming cheaper every year, and one can easily envision a nearby future where they are part of our regular medical check-ups, but solving this mixture problem poses formidable methodological and practical challenges. For instance, the number of possible solutions even when considering a single gene is larger than the number of atoms in the universe, and the required calculations can be prohibitive even on the latest computers. This example highlights some of the most important challenges that are common to many modern applications of mixture models, hence solving them would have positive implications in a much wider range of areas (e.g. technology, industry, public policy, social sciences).

In this project we aim to develop a framework that can be used to solve the alternative splicing and other challenging mixture model problems. Our first goal is to propose a novel formulation for general mixture models that has proven highly successful in other complex settings, studying both theoretical and practical aspects. In our example this formulation says that, when identifying different conversations in the restaurant, we cannot have two tables uttering exactly the same words (else these should be regarded as a single conversation). This apparently simple consideration turns out to have important mathematical consequences that greatly simplify the problem. Our second goal is to apply these general principles to solve the alternative splicing problem, where we will also bring to bear scientific considerations to ensure that the solution is useful in practice. Our third goal is to propose and study strategies to make fast and accurate calculations, which can quickly become prohibitive, so that a computer can find the solution in reasonable time. As part of this project we will provide open-source software that others can use freely for their own research or applied data analysis.

Given the technical challenges involved the bulk of the research will be carried at the Dept. of Statistics at the University of Warwick by the PI working with other members of the department and several further statistical and biomedical collaborators from prestigious overseas universities and hospitals who will be actively involved in the project, e.g. helping translate our methodology to biomedical research and clinical practice, or ensuring that our statistical predictions are indeed accurate.

In this project we aim to develop a framework that can be used to solve the alternative splicing and other challenging mixture model problems. Our first goal is to propose a novel formulation for general mixture models that has proven highly successful in other complex settings, studying both theoretical and practical aspects. In our example this formulation says that, when identifying different conversations in the restaurant, we cannot have two tables uttering exactly the same words (else these should be regarded as a single conversation). This apparently simple consideration turns out to have important mathematical consequences that greatly simplify the problem. Our second goal is to apply these general principles to solve the alternative splicing problem, where we will also bring to bear scientific considerations to ensure that the solution is useful in practice. Our third goal is to propose and study strategies to make fast and accurate calculations, which can quickly become prohibitive, so that a computer can find the solution in reasonable time. As part of this project we will provide open-source software that others can use freely for their own research or applied data analysis.

Given the technical challenges involved the bulk of the research will be carried at the Dept. of Statistics at the University of Warwick by the PI working with other members of the department and several further statistical and biomedical collaborators from prestigious overseas universities and hospitals who will be actively involved in the project, e.g. helping translate our methodology to biomedical research and clinical practice, or ensuring that our statistical predictions are indeed accurate.

### Planned Impact

The main beneficiaries from this research are

- Researchers in statistics and related disciplines working on mixture models and their extensions

- Researchers in bioinformatics and biomedicine working on gene expression

- Data analysts working on applied problems benefiting from mixture models

- On a longer time scale, large-scale genomics consortia, public health and quality of life

The impact on researchers in statistics, bioinformatics and biomedicine is described in Section "Academic beneficiaries". Mixture models are used to tackle a variety of applied data analysis problems in the biological sciences, finance, marketing, house pricing, handwriting recognition or engineering, to mention a few examples. The goals of this project are directly transferable to these problems, therefore its successful completion would have a longer term impact in terms of the precision of these data analyses, as well as lowering the cost of the computations required to obtain results. Further, our study of ABC methods to integrate complex likelihoods has a potential impact beyond mixture models. Two of the main challenges of Big Data are precisely to be able to process large amounts of data quickly (to make fast decisions) and accurately (to make correct decisions), hence this research lies at the heart of the development of tools required to face modern data analysis challenges. The PI and students involved in the project will acquire data analysis methodology skills easily transferable to other employment sectors.

Our collaborators include a major hospital and a biomedical institute with groups in bioinformatics and cancer genomics laboratories, which helps disseminate our work among the target audience. On a longer time scale the availability of rigorous methodology to study alternative splicing would benefit large consortia. For instance, the ENCODE and modENCODE projects (the next step following the Human Genome Project) to identify functional elements in the genome, the GTEx portal to characterize gene expression and its regulation across tissues, or The Cancer Genome Atlas designed to characterize 20 cancers at a molecular level all hinge on characterizing gene and isoform expression accurately, and hence would benefit from this research. Our project also has a longer term potential for human health and treatment, e.g. by aiding the design of individualized therapies based on isoform expression profiling. In fact, the PI has a track record in personalized medicine projects and patents for colon and liver cancer metastasis. The fact that alternative splicing has been shown to be linked to cancer and neurodegenerative disorders (among other diseases) and that these are unfortunately among the most prevalent ailments affecting the general population in developed countries supports that the project has ample potential for quality of life and health.

- Researchers in statistics and related disciplines working on mixture models and their extensions

- Researchers in bioinformatics and biomedicine working on gene expression

- Data analysts working on applied problems benefiting from mixture models

- On a longer time scale, large-scale genomics consortia, public health and quality of life

The impact on researchers in statistics, bioinformatics and biomedicine is described in Section "Academic beneficiaries". Mixture models are used to tackle a variety of applied data analysis problems in the biological sciences, finance, marketing, house pricing, handwriting recognition or engineering, to mention a few examples. The goals of this project are directly transferable to these problems, therefore its successful completion would have a longer term impact in terms of the precision of these data analyses, as well as lowering the cost of the computations required to obtain results. Further, our study of ABC methods to integrate complex likelihoods has a potential impact beyond mixture models. Two of the main challenges of Big Data are precisely to be able to process large amounts of data quickly (to make fast decisions) and accurately (to make correct decisions), hence this research lies at the heart of the development of tools required to face modern data analysis challenges. The PI and students involved in the project will acquire data analysis methodology skills easily transferable to other employment sectors.

Our collaborators include a major hospital and a biomedical institute with groups in bioinformatics and cancer genomics laboratories, which helps disseminate our work among the target audience. On a longer time scale the availability of rigorous methodology to study alternative splicing would benefit large consortia. For instance, the ENCODE and modENCODE projects (the next step following the Human Genome Project) to identify functional elements in the genome, the GTEx portal to characterize gene expression and its regulation across tissues, or The Cancer Genome Atlas designed to characterize 20 cancers at a molecular level all hinge on characterizing gene and isoform expression accurately, and hence would benefit from this research. Our project also has a longer term potential for human health and treatment, e.g. by aiding the design of individualized therapies based on isoform expression profiling. In fact, the PI has a track record in personalized medicine projects and patents for colon and liver cancer metastasis. The fact that alternative splicing has been shown to be linked to cancer and neurodegenerative disorders (among other diseases) and that these are unfortunately among the most prevalent ailments affecting the general population in developed countries supports that the project has ample potential for quality of life and health.

### Organisations

### Publications

Description | We showed that in the common situation where one is analysing a dataset that one suspects has been originated as a mixture of several underlying signals/subpopulations, one should generally focus on subpopulations that are minimally distinct. We showed how to formulate a probability model that ensures this type of behaviour, developed some theory and examples to prove that the formulation indeed behaves as intended, and applied the findings to a problem in molecular biology related to gene variations (alternative splicing) that is important to understand many biological processes and diseases. |

Exploitation Route | Statistical researchers may build upon our basic framework to develop methods for clustering, density estimation or signal deconvolution. Bioinformaticians may use our alternative splicing software as part of their analysis pipelines used to analyse molecular biology data, and thus in the longer term help unravel the mechanisms underlying complex processes or diseases. |

Sectors | Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Healthcare,Pharmaceuticals and Medical Biotechnology,Retail |

Description | University Hospital Birmingham (Informatics) |

Organisation | Queen Elizabeth Hospital Birmingham Charity (QEHB) |

Country | United Kingdom |

Sector | Charity/Non Profit |

PI Contribution | Collaborative project between my research group and the medics + members of the informatics team at QEHB. We have provided data science expertise in working with the QEHB informatics data sets. |

Collaborator Contribution | Data access Expertise with the informatics systems Expert medical knowledge |

Impact | Research paper currently under review This project is multi-disciplinary (statistics, machine learning, data science, medicine, informatics) |

Start Year | 2015 |

Title | Bioconductor package casper |

Description | The package implements methods to infer alternative splicing from RNA-sequencing data, including the estimation of splicing isoform abundances and model selection methods to discern which isoforms are truly expressed. |

Type Of Technology | Software |

Year Produced | 2016 |

Open Source License? | Yes |

Impact | The main impact is ensuring that the developed methodology is accessible to researchers in biomedicine and bioinformatics. |

URL | https://www.bioconductor.org/packages/release/bioc/html/casper.html |

Title | R package twopiece |

Description | The package implements routines to fit mixture models where the components are allowed to have heavy tails and asymmetry. It also provides functions to evaluate the corresponding density, likelihood function, cluster observations and plot analysis results, e.g. cluster probabilities. |

Type Of Technology | Software |

Year Produced | 2015 |

Open Source License? | Yes |

Impact | The main impact is making the developed methodology available to practitioners for data analyses routinely conducted in many scientific fields and industry, e.g. robust clustering analyses or density estimation. |

URL | https://r-forge.r-project.org/projects/twopiece |