Visualization Methods for 'Omics type data: Interactive exploration of High Dimensional Data

Lead Research Organisation: Newcastle University
Department Name: Sch of Computer Science

Abstract

Over the years, our ability to investigate biological entities has been improving at an exponential rate, tasks such as sequencing a human genome (1) that were previously costly and time consuming have become significantly more accessible. One of the main causes of this has been development of next generation sequencing technologies, which allow for superior data collection at lower costs compared to older methods.

However, superior data collection doesn't necessarily correlate into superior knowledge generation. With significantly larger and more complex datasets, getting to fully explore and understanding all of the interesting structures within the data has in itself become a more time and resource intensive task.

These biological datasets tend to be extremely high dimensional, within a Microbiomics dataset an individual species of bacteria equates to a single dimension; a single sample could contain tens or hundreds of thousands of different bacteria. Most visualization methods can only deal with a moderate number of visual dimensions at once, before being limited by screen space and human understanding.

Analysing an entire dataset isn't feasible for a single user to do unaided; conversely automated methods cannot take advantage of the domain knowledge of the user to understand what combinations of dimensions are relevant (2). While there are a lot of visualisation tools available, the diversity can be confusing and often the exact user requirements are not met (3).

This project aims to develop methods that support the reliable and explorative visual analysis of this extremely high-dimensional biological data. The research will focus on developing interactive and flexible visualisation methods that support users in gaining and sharing knowledge along with aiding hypothesis generation.

The final visualization solution should address the following general criteria:
* Enable visualization of extremely high dimensional data.
* Enable the identification and exploration of interesting data subsets.
* Support the analysis of different types of 'Omics data, such as Microbiomics, Transcriptomics and Metabolomics.
* Provide visual representation of supporting statistics.

Additionally, the project intends to involve the end users (bioinformaticians and microbiologists) directly throughout all stages of the project. By interacting with the users early on, we will have a much clearer understanding of the challenges that they face and can guide development based on direct user requirements and feedback as the project progresses.

This will be enabled through our collaboration with Unilever, with the project going through a series of regular short placements to meet up with their bio-scientists. This will follow a cycle of collaborating with the users; developing methods and designs based on their input and then iterating the process based on their feedback. The end result of this will be a fully functional software tool, designed and tested with industry professionals. Both as a way to demonstrate the research that has been done and to be a useable visualisation tool.

References
1. National Human Genome Research Institute. The Cost of Sequencing a Human Genome. National Human Genome Research Institute. [Online] July 6, 2016. https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/.
2. Josua Krause, Aritra Dasgupta, Jean-Daniel Fekete, Enrico Bertini. SeekAView: An Intelligent Dimensionality Reduction Strategy for Navigating High-Dimensional Data Spaces. 2016.
3. Seán I O'Donoghue, Anne-Claude Gavin, Nils Gehlenborg, David S Goodsell, Jean-Karim Hériché, Cydney B Nielsen, Chris North, Arthur J Olson, James B Procter, David W Shattuck, Thomas Walter, Bang Wong. Vizualizing biological data - now and in the future. 2010

Publications

10 25 50