Development of a Rapid Processing Pipeline and Graph-based Visualization for the Analysis of Next Generation Sequencing Data

Lead Research Organisation: University of Edinburgh
Department Name: The Roslin Institute

Abstract

Over the last decade or so there has been an explosion of biological data emanating from new laboratory analysis platforms. These data are increasingly complex and large-scale. DNA sequencing in particular has revolutionized the biomedical and biological sciences over the last decade. The recent availability of new DNA sequencing platforms mean that orders of magnitude more data can be produced relative to what was possible just a few years ago. These advances have further changed the way we think about scientific approaches to basic, applied and clinical research. For example, the ability to sequence the whole genome of many related organisms has allowed large-scale comparative and evolutionary studies to be performed that were until recently unimaginable. Sequencing can also be used to determine which genes are currently active at any given state or time by RNA sequencing for gene-expression analyses. In analysing gene-expression studies, RNA-sequencing can identify and quantify rare genes without prior knowledge and can provide information regarding sequence variation in the identified genes. When combined with 'pull-down' technologies, these approaches can also answer important questions regarding gene regulation such as transcription factor or microRNA target binding. These advances in technology however come with significant analytical challenges, in particular with respect to the sheer scale of data now being produced. For example a single run of an Illumina Solexa GA-2 machine produces approximately 100Gb of sequence data alone. A number of approaches exist for the analysis of these data, however they are usually slow and extremely computationally intensive, requiring large-memory computers or high-performance computing clusters in order to effectively analyse these data.

How best to analyse this information is an ongoing and active discussion. One approach to resolving some of these issues is to both develop fast optimal algorithms for data analysis and to visualise and analyse data as network graphs. This proposal is to develop an optimised system for the analysis of such data. It will involve the development of extremely fast and optimised algorithms for processing the data for which we have already created prototypes. We will utilise the relatively new field of GPU hardware acceleration to allow these algorithms to run significantly faster when utilising specialised hardware on a consumer 3D graphics card. Data processed through the system will be visualised using a customised 3D visualisation environment designed around the existing BioLayout Express3D system. These sequence graphs have already proved themselves useful identifying novel sequence elements and aiding the assembly of their consensus sequences, in many cases helping to identify where issues lie. Furthermore, we intend to harness the power of correlation analysis for working with RNA-seq data, providing an integrated solution for moving from primary sequence data through to co-expression analysis of tags per gene summaries. In doing however we will also provide network and alignment based views of the primary data that underpin the summary analyses. This will provide novel ways for users to see their data and how reads interact with each other and the genome itself. The entire system will be modular and each module will be accessed from a graphical user interface written in Java, that gives the user control over analysis modules and allows rapid analysis of large-scale datasets from the primary data to genome/gene level analyses.

Technical Summary

We propose to develop a high-performance system for the processing, analysis and visualization of NGS data. Currently there are many issues associated with NGS data analysis that make this data a significant challenge for most laboratories to deal with. We propose to build a highly optimised data processing system for dealing with these data based on our extensive experience of computational biology algorithm development and visualization technologies. The modular system with components will be written in C/C++, Java OpenGL and OpenCL where appropriate. Raw sequencing data will be processed by Reaper an ultra-fast read processing engine that de-multiplexes sample barcodes, removes adapter contamination, polyA contamination and low-complexity sequence. Reads are then examined for redundancy by the Tally algorithm which collates sequence data and produces QC metrics for further analysis. Cleaned, processed reads are scanned against the genome to determine their point of origin. We intend to produce a fast parallel mapping tool using the Burrows-Wheeler system and utilising parallel optimisation and hardware acceleration using GPU computation. Annotated reads are produced after mapping with further QC data. Reads will also be cross-mapped to each other by a GPU hardware accelerated suffix-array algorithm that allows rapid computation of read-read similarities across a specified locus. These data will be passed to the visualization engine that allows read-graph topology analysis and also custom assembly routines together with further visual QC. Depending on the user requirements a range of results can be produced including: transcript expression summaries, differential expression across loci, read-read assemblies and graphs and tracks for genomic visualization of reads within the IGV tool or the UCSC browser. We believe this combination of high-performance algorithms with visualization and a graphical interface will be of great benefit to the community.

Planned Impact

The advent of NGS technology represents significant challenges in terms of data magnitude and complexity. Tools and techniques that can deal with such data are urgently required. The system for analysis and visualization described in this proposal would be of great assistance to a large number of researchers. Computational tools and techniques have a significant impact on the way biological science is being performed in the post-genomic era. Freely available tools and software allow researchers throughout the UK and worldwide to quickly adapt new technology to their own research goals. In particular, our proposal hopes to develop a powerful analysis system which is simple and intuitive to use and will minimise the requirement for expert bioinformatics support, thereby helping to bridge the gap between wet and dry research. On an economic and societal level this proposal could have significant secondary benefits, allowing the application of new sequencing technologies to many different biological research problems. Benefits also exist for human and animal health as clinicians and veterinary scientists adopt NGS technologies for diagnostic purposes and to explore population variation and its impact on disease. Likewise, the tools and resources we describe in this proposal will likely be of benefit to pharmaceutical and agriculture sectors.

Both the Freeman and Enright laboratories have extensive networks of collaborators throughout the UK and beyond, and the research described will support many new and existing collaborations. We envisage that these collaborations will improve both communication and scientific effectiveness across the institutes involved. The Roslin Institute is a world-leading agricultural research institute and The European Bioinformatics Institute is a world-leader in delivering computational tools, resources and research to the international biological community. Both applicants are experts in the delivery of usable tools via the provision of intuitive human interfaces. We will support this through a number of avenues including publications, training and outreach activities, to promote and support the proposed research. The EBI has a well-developed outreach team who promote our resources across the UK and worldwide through online tutorials, local presentations and as part of a travelling roadshow.

The Roslin Institute and the European Molecular Biology Laboratory (EMBL) both have technology transfer offices. Both applicants have been involved in patent applications, technology transfer and company formation previously. Indeed, both were founding members of Fios Genomics Ltd., a data analysis company that has just received significant investment. Where appropriate, commercialisation options will be explored.

Independently both applicants are actively involved in a range of teaching activities across the United Kingdom and world-wide. This allows us to promote computational techniques, tools and resources to biologists who may not have a strong background in bioinformatics or computational biology. The EBI has a dedicated Industry program that brings together leaders from industry with bioinformaticians. This provides a platform to inform industry of recent advances in research and also to learn what are the specific needs and requirements from industry in the UK and world-wide. We foresee significant benefits from this research to clinician science. Both applicants have previous successful collaborations with clinical groups including the University of Cambridge teaching hospital (Addenbrookes, NHS Foundation Trust) and Royal Infirmary of Edinburgh (NHS Lothian). It is possible therefore that such interactions may produce findings that result in the creation of novel therapeutic or diagnostic procedures with a potential for significant impact on human health.

Publications

10 25 50
 
Description This grant funds the development of a new approach to visualising DNA sequencing data. By calculating the similarity between DNA 'reads', fragments of DNA produced by sequencing machines, it is possible to produce a graphical representation of how each DNA read relates to others. These graphs can be viewed in our tool BioLayout Express3D where reads are represented as nodes (spheres) and the relationships between them as lines or edges. Graphs display the structure of DNA assemblies and allow one to see issues of features within them. We have been exploring this approach when applied to RNA-seq data where the graph's structure help to resolve mRNA assemblies, identification of splice variants and reveal hidden issues with data itself. The work is currently being further developed as we strive to improve upon the initial implementation of the pipeline, modifying the way graphs are calculated and enabling the same approach to be used for other sequence data, e.g. bacterial genomes.
Exploitation Route We hope the approach and the tools we have developed, which are freely available, will be useful to others in their efforts to maximise the information they can derive from their sequencing data. Papers describing the work funded under this grant are in preparation.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://seq-graph.roslin.ed.ac.uk/
 
Description We have designed and implemented an analysis pipeline to go from primary RNA-seq data to generate RNA assembly graphs - see NGS Graph Generator resource http://seq-graph.roslin.ed.ac.uk/. We have also re-engineered various aspects BioLayout Express3D (now called Graphia Pro) to support the visualisation and analysis of such graphs. This approach allows for the reconstruction RNA-seq data into a network graph in order to help identify issues in sequencing and interpret complex splicing events. A paper describing the work was submitted to a preprint server (https://doi.org/10.1101/409573) and is now being revised following submission to NAR. The work behind this paper also formed the basis of PhD student Famhi wan Nazarie.
First Year Of Impact 2018
Sector Education,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title Network analysis of short read sequencing data 
Description The method developed under this grant supports the visualisation and analysis of short read sequencing data. Paper describing this work submitted. 
Type Of Material Improvements to research infrastructure 
Provided To Others? No  
Impact None as yet 
URL http://seq-graph.roslin.ed.ac.uk/
 
Title BioLayout Express3D 
Description BioLayout Express3D is a powerful tool for the visualization and analysis of network graphs. Network-based approaches are becoming increasingly popular for the analysis of complex systems of interaction and high dimensional data. Networks can be produced from a wide variety of relationships between entities. In biology this includes the interactions between individuals, disease transmission, sequence similarity, metabolic pathways, protein interactions, pathways, regulatory cascades, gene expression, clinical data. This tool represents the product of over 15 years research and development and uses a combination of high end 3D graphics, algorithms and user-friendly graphical interfaces to allow the user to explore and better analyse their data. 
Type Of Material Data analysis technique 
Year Produced 2007 
Provided To Others? Yes  
Impact The website currently receives hits from approx. 700 users a month from around the globe and is used 1500-2000 times a month. This tool has been used in analyses that have contributed to over 40 publications. The technology and know how developed is currently en route to being commercialised and developed further by a new spin out company called Kajeka. 
URL http://www.biolayout.org/
 
Description BioLayout Express3D project 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution Together we have worked as a team in developing the BioLayout tool with almost all of the coding has been performed in Edinburgh.
Collaborator Contribution Advice, shared publications, ideas.
Impact Outcomes from this work include the tool itself, numerous publications by ourselves and others. The original paper published in PLoS Comp Biol (2007) has been cited 243 times and the Nature Protocols paper (2009) has been cited 239 times (Google Scholar, Feb 2019). This was a multidisciplinary collaboration involving biologists, computer scientists, software engineers and mathematicians.The codebase for this software was licensed from the University of Edinburgh by Kajeka Ltd a UoE spinout company (www.kajeka.com).
Start Year 2006
 
Title BioLayout Express3D 
Description This is a network analysis tool and represents the product of over 15 years research and development. It uses a combination of high end 3D graphics, algorithms and user-friendly graphical interfaces to allow the user to explore and better analyse their data. 
IP Reference  
Protection Copyrighted (e.g. software)
Year Protection Granted 2014
Licensed Yes
Impact The website currently receives hits from approx. 700 users a month from around the globe and is used 1500-2000 times a month. This tool has been used in analyses that have contributed to over 40 publications. The technology and know how developed is currently en route to being commercialised and developed further by a new spin out company called Kajeka.
 
Title BioLayout Express3D 
Description BioLayout Express3D is a powerful tool for the visualization and analysis of network graphs. Network-based approaches are becoming increasingly popular for the analysis of complex systems of interaction and high dimensional data. Networks can be produced from a wide variety of relationships between entities. In biology this includes the interactions between individuals, disease transmission, sequence similarity, metabolic pathways, protein interactions, pathways, regulatory cascades, gene expression, clinical data. This tool represents the product of over 15 years research and development and uses a combination of high end 3D graphics, algorithms and user-friendly graphical interfaces to allow the user to explore and better analyse their data. 
Type Of Technology Software 
Year Produced 2007 
Open Source License? Yes  
Impact The website currently receives hits from approx. 700 users a month from around the globe and is used 1500-2000 times a month. This tool has been used in analyses that have contributed to over 40 publications. 
URL http://www.biolayout.org/
 
Title Graphia Professional 
Description This is a network analysis tool designed for the analysis of biological data. It is a commercial product produced by Kajeka Ltd a company founded on the IP and know how behind BioLayout Express3D. 
Type Of Technology Webtool/Application 
Year Produced 2015 
Open Source License? Yes  
Impact This is the first product of Kajeka Ltd and sales of it will help support the company as it grows. 
URL https://kajeka.com/graphia-professional/
 
Title NGS graph generator 
Description This online tool allows users to generate tanscsequence graphs and then v 
Type Of Technology Webtool/Application 
Year Produced 2014 
Impact Improved visualisation and detection of splice variants and DNA assemblies. 
URL http://seq-graph.roslin.ed.ac.uk/
 
Company Name Kajeka Ltd 
Description The biological sciences generate vast amounts of data from numerous analytical platforms; the data is big, complex and multi-layered. Like all big data, its analysis and correct interpretation represent a significant challenge. Kajeka produces a data analysis platform for the visualization and analysis of numerical matrices and networks. Our network analysis tools provide a rapid analysis pipelines producing visually intuitive network representations of data structure, whatever the source, allowing you to make more effective, more efficient, data-driven decisions - quickly. Company offering network analysis tools for the analysis of primarily of omics data. The company's first product, Graphia Professional, is based on IP and know how behind BioLayout Express3D. The company are now (Feb 2019) about to release a new network analysis platform for the analysis of biological data called Graphia. 
Year Established 2014 
Impact The company is still at an early stage in its development but is beginning to gain commercial traction and clients for the software. We have won one Scottish Enterprise grant and have submitted a second for the development of our next generation software. We also secured seed investment of £230k last year. We are collaborating with academic scientists who develop network analysis algorithms or who use network analysis in their work. We have a number of ongoing discussions with major pharmaceutical and platform providers.
Website http://www.kajeka.com
 
Description Easter Bush Campus Open Day 2018 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact The group ran two stands at this event, one on our work with honey bees including general interest items about beekeeping and bees, another was a display of the network analysis tools we have developed initially through BBSRC funding and more laterally as part of the spinout company Kajeka Ltd. Presenters Dr Mark Barnett and Prof Tom Freeman
Year(s) Of Engagement Activity 2018
URL https://www.ed.ac.uk/roslin/community-engagement/public-events/campus-open-day-2018