Development of a Rapid Processing Pipeline and Graph-based Visualization for the Analysis of Next Generation Sequencing Data
Lead Research Organisation:
University of Edinburgh
Department Name: The Roslin Institute
Abstract
Over the last decade or so there has been an explosion of biological data emanating from new laboratory analysis platforms. These data are increasingly complex and large-scale. DNA sequencing in particular has revolutionized the biomedical and biological sciences over the last decade. The recent availability of new DNA sequencing platforms mean that orders of magnitude more data can be produced relative to what was possible just a few years ago. These advances have further changed the way we think about scientific approaches to basic, applied and clinical research. For example, the ability to sequence the whole genome of many related organisms has allowed large-scale comparative and evolutionary studies to be performed that were until recently unimaginable. Sequencing can also be used to determine which genes are currently active at any given state or time by RNA sequencing for gene-expression analyses. In analysing gene-expression studies, RNA-sequencing can identify and quantify rare genes without prior knowledge and can provide information regarding sequence variation in the identified genes. When combined with 'pull-down' technologies, these approaches can also answer important questions regarding gene regulation such as transcription factor or microRNA target binding. These advances in technology however come with significant analytical challenges, in particular with respect to the sheer scale of data now being produced. For example a single run of an Illumina Solexa GA-2 machine produces approximately 100Gb of sequence data alone. A number of approaches exist for the analysis of these data, however they are usually slow and extremely computationally intensive, requiring large-memory computers or high-performance computing clusters in order to effectively analyse these data.
How best to analyse this information is an ongoing and active discussion. One approach to resolving some of these issues is to both develop fast optimal algorithms for data analysis and to visualise and analyse data as network graphs. This proposal is to develop an optimised system for the analysis of such data. It will involve the development of extremely fast and optimised algorithms for processing the data for which we have already created prototypes. We will utilise the relatively new field of GPU hardware acceleration to allow these algorithms to run significantly faster when utilising specialised hardware on a consumer 3D graphics card. Data processed through the system will be visualised using a customised 3D visualisation environment designed around the existing BioLayout Express3D system. These sequence graphs have already proved themselves useful identifying novel sequence elements and aiding the assembly of their consensus sequences, in many cases helping to identify where issues lie. Furthermore, we intend to harness the power of correlation analysis for working with RNA-seq data, providing an integrated solution for moving from primary sequence data through to co-expression analysis of tags per gene summaries. In doing however we will also provide network and alignment based views of the primary data that underpin the summary analyses. This will provide novel ways for users to see their data and how reads interact with each other and the genome itself. The entire system will be modular and each module will be accessed from a graphical user interface written in Java, that gives the user control over analysis modules and allows rapid analysis of large-scale datasets from the primary data to genome/gene level analyses.
How best to analyse this information is an ongoing and active discussion. One approach to resolving some of these issues is to both develop fast optimal algorithms for data analysis and to visualise and analyse data as network graphs. This proposal is to develop an optimised system for the analysis of such data. It will involve the development of extremely fast and optimised algorithms for processing the data for which we have already created prototypes. We will utilise the relatively new field of GPU hardware acceleration to allow these algorithms to run significantly faster when utilising specialised hardware on a consumer 3D graphics card. Data processed through the system will be visualised using a customised 3D visualisation environment designed around the existing BioLayout Express3D system. These sequence graphs have already proved themselves useful identifying novel sequence elements and aiding the assembly of their consensus sequences, in many cases helping to identify where issues lie. Furthermore, we intend to harness the power of correlation analysis for working with RNA-seq data, providing an integrated solution for moving from primary sequence data through to co-expression analysis of tags per gene summaries. In doing however we will also provide network and alignment based views of the primary data that underpin the summary analyses. This will provide novel ways for users to see their data and how reads interact with each other and the genome itself. The entire system will be modular and each module will be accessed from a graphical user interface written in Java, that gives the user control over analysis modules and allows rapid analysis of large-scale datasets from the primary data to genome/gene level analyses.
Technical Summary
We propose to develop a high-performance system for the processing, analysis and visualization of NGS data. Currently there are many issues associated with NGS data analysis that make this data a significant challenge for most laboratories to deal with. We propose to build a highly optimised data processing system for dealing with these data based on our extensive experience of computational biology algorithm development and visualization technologies. The modular system with components will be written in C/C++, Java OpenGL and OpenCL where appropriate. Raw sequencing data will be processed by Reaper an ultra-fast read processing engine that de-multiplexes sample barcodes, removes adapter contamination, polyA contamination and low-complexity sequence. Reads are then examined for redundancy by the Tally algorithm which collates sequence data and produces QC metrics for further analysis. Cleaned, processed reads are scanned against the genome to determine their point of origin. We intend to produce a fast parallel mapping tool using the Burrows-Wheeler system and utilising parallel optimisation and hardware acceleration using GPU computation. Annotated reads are produced after mapping with further QC data. Reads will also be cross-mapped to each other by a GPU hardware accelerated suffix-array algorithm that allows rapid computation of read-read similarities across a specified locus. These data will be passed to the visualization engine that allows read-graph topology analysis and also custom assembly routines together with further visual QC. Depending on the user requirements a range of results can be produced including: transcript expression summaries, differential expression across loci, read-read assemblies and graphs and tracks for genomic visualization of reads within the IGV tool or the UCSC browser. We believe this combination of high-performance algorithms with visualization and a graphical interface will be of great benefit to the community.
Planned Impact
The advent of NGS technology represents significant challenges in terms of data magnitude and complexity. Tools and techniques that can deal with such data are urgently required. The system for analysis and visualization described in this proposal would be of great assistance to a large number of researchers. Computational tools and techniques have a significant impact on the way biological science is being performed in the post-genomic era. Freely available tools and software allow researchers throughout the UK and worldwide to quickly adapt new technology to their own research goals. In particular, our proposal hopes to develop a powerful analysis system which is simple and intuitive to use and will minimise the requirement for expert bioinformatics support, thereby helping to bridge the gap between wet and dry research. On an economic and societal level this proposal could have significant secondary benefits, allowing the application of new sequencing technologies to many different biological research problems. Benefits also exist for human and animal health as clinicians and veterinary scientists adopt NGS technologies for diagnostic purposes and to explore population variation and its impact on disease. Likewise, the tools and resources we describe in this proposal will likely be of benefit to pharmaceutical and agriculture sectors.
Both the Freeman and Enright laboratories have extensive networks of collaborators throughout the UK and beyond, and the research described will support many new and existing collaborations. We envisage that these collaborations will improve both communication and scientific effectiveness across the institutes involved. The Roslin Institute is a world-leading agricultural research institute and The European Bioinformatics Institute is a world-leader in delivering computational tools, resources and research to the international biological community. Both applicants are experts in the delivery of usable tools via the provision of intuitive human interfaces. We will support this through a number of avenues including publications, training and outreach activities, to promote and support the proposed research. The EBI has a well-developed outreach team who promote our resources across the UK and worldwide through online tutorials, local presentations and as part of a travelling roadshow.
The Roslin Institute and the European Molecular Biology Laboratory (EMBL) both have technology transfer offices. Both applicants have been involved in patent applications, technology transfer and company formation previously. Indeed, both were founding members of Fios Genomics Ltd., a data analysis company that has just received significant investment. Where appropriate, commercialisation options will be explored.
Independently both applicants are actively involved in a range of teaching activities across the United Kingdom and world-wide. This allows us to promote computational techniques, tools and resources to biologists who may not have a strong background in bioinformatics or computational biology. The EBI has a dedicated Industry program that brings together leaders from industry with bioinformaticians. This provides a platform to inform industry of recent advances in research and also to learn what are the specific needs and requirements from industry in the UK and world-wide. We foresee significant benefits from this research to clinician science. Both applicants have previous successful collaborations with clinical groups including the University of Cambridge teaching hospital (Addenbrookes, NHS Foundation Trust) and Royal Infirmary of Edinburgh (NHS Lothian). It is possible therefore that such interactions may produce findings that result in the creation of novel therapeutic or diagnostic procedures with a potential for significant impact on human health.
Both the Freeman and Enright laboratories have extensive networks of collaborators throughout the UK and beyond, and the research described will support many new and existing collaborations. We envisage that these collaborations will improve both communication and scientific effectiveness across the institutes involved. The Roslin Institute is a world-leading agricultural research institute and The European Bioinformatics Institute is a world-leader in delivering computational tools, resources and research to the international biological community. Both applicants are experts in the delivery of usable tools via the provision of intuitive human interfaces. We will support this through a number of avenues including publications, training and outreach activities, to promote and support the proposed research. The EBI has a well-developed outreach team who promote our resources across the UK and worldwide through online tutorials, local presentations and as part of a travelling roadshow.
The Roslin Institute and the European Molecular Biology Laboratory (EMBL) both have technology transfer offices. Both applicants have been involved in patent applications, technology transfer and company formation previously. Indeed, both were founding members of Fios Genomics Ltd., a data analysis company that has just received significant investment. Where appropriate, commercialisation options will be explored.
Independently both applicants are actively involved in a range of teaching activities across the United Kingdom and world-wide. This allows us to promote computational techniques, tools and resources to biologists who may not have a strong background in bioinformatics or computational biology. The EBI has a dedicated Industry program that brings together leaders from industry with bioinformaticians. This provides a platform to inform industry of recent advances in research and also to learn what are the specific needs and requirements from industry in the UK and world-wide. We foresee significant benefits from this research to clinician science. Both applicants have previous successful collaborations with clinical groups including the University of Cambridge teaching hospital (Addenbrookes, NHS Foundation Trust) and Royal Infirmary of Edinburgh (NHS Lothian). It is possible therefore that such interactions may produce findings that result in the creation of novel therapeutic or diagnostic procedures with a potential for significant impact on human health.
People |
ORCID iD |
Tom Freeman (Principal Investigator) |
Publications
Nazarie W.F.
(2018)
Visualisation and analysis of RNA-Seq assembly networks
in bioRxiv (under review Nucleic Acids Research)
Wright DW
(2014)
Visualisation of BioPAX Networks using BioLayout Express (3D).
in F1000Research
Livigni A
(2018)
A graphical and computational modeling platform for biological pathways.
in Nature protocols
Livigni A
(2018)
A graphical and computational modeling platform for biological pathways.
in Nature protocols
Nazarie FW
(2019)
Visualization and analysis of RNA-Seq assembly graphs.
in Nucleic acids research
O'Hara L
(2016)
Modelling the Structure and Dynamics of Biological Pathways.
in PLoS biology
Nazarie F
(2019)
Visualization and analysis of RNA-Seq assembly graphs.
Nazarie F
(2018)
Visualisation and analysis of RNA-Seq assembly graphs
Description | This grant funds the development of a new approach to visualising DNA sequencing data. By calculating the similarity between DNA 'reads', fragments of DNA produced by sequencing machines, it is possible to produce a graphical representation of how each DNA read relates to others. These graphs can be viewed in our tool BioLayout Express3D where reads are represented as nodes (spheres) and the relationships between them as lines or edges. Graphs display the structure of DNA assemblies and allow one to see issues of features within them. We have been exploring this approach when applied to RNA-seq data where the graph's structure help to resolve mRNA assemblies, identification of splice variants and reveal hidden issues with data itself. The work is currently being further developed as we strive to improve upon the initial implementation of the pipeline, modifying the way graphs are calculated and enabling the same approach to be used for other sequence data, e.g. bacterial genomes. |
Exploitation Route | We hope the approach and the tools we have developed, which are freely available, will be useful to others in their efforts to maximise the information they can derive from their sequencing data. Papers describing the work funded under this grant are in preparation. |
Sectors | Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology |
URL | http://seq-graph.roslin.ed.ac.uk/ |
Description | We have designed and implemented an analysis pipeline to go from primary RNA-seq data to generate RNA assembly graphs - see NGS Graph Generator resource http://seq-graph.roslin.ed.ac.uk/. We have also re-engineered various aspects BioLayout Express3D (now called Graphia Pro) to support the visualisation and analysis of such graphs. This approach allows for the reconstruction RNA-seq data into a network graph in order to help identify issues in sequencing and interpret complex splicing events. A paper describing the work was submitted to a preprint server (https://doi.org/10.1101/409573) and is now being revised following submission to NAR. The work behind this paper also formed the basis of PhD student Famhi wan Nazarie. |
First Year Of Impact | 2018 |
Sector | Education,Pharmaceuticals and Medical Biotechnology |
Impact Types | Economic |
Title | Network analysis of short read sequencing data |
Description | The method developed under this grant supports the visualisation and analysis of short read sequencing data. Paper describing this work submitted. |
Type Of Material | Improvements to research infrastructure |
Provided To Others? | No |
Impact | None as yet |
URL | http://seq-graph.roslin.ed.ac.uk/ |
Title | BioLayout Express3D |
Description | BioLayout Express3D is a powerful tool for the visualization and analysis of network graphs. Network-based approaches are becoming increasingly popular for the analysis of complex systems of interaction and high dimensional data. Networks can be produced from a wide variety of relationships between entities. In biology this includes the interactions between individuals, disease transmission, sequence similarity, metabolic pathways, protein interactions, pathways, regulatory cascades, gene expression, clinical data. This tool represents the product of over 15 years research and development and uses a combination of high end 3D graphics, algorithms and user-friendly graphical interfaces to allow the user to explore and better analyse their data. |
Type Of Material | Data analysis technique |
Year Produced | 2007 |
Provided To Others? | Yes |
Impact | The website currently receives hits from approx. 700 users a month from around the globe and is used 1500-2000 times a month. This tool has been used in analyses that have contributed to over 40 publications. The technology and know how developed is currently en route to being commercialised and developed further by a new spin out company called Kajeka. |
URL | http://www.biolayout.org/ |
Description | BioLayout Express3D project |
Organisation | EMBL European Bioinformatics Institute (EMBL - EBI) |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Together we have worked as a team in developing the BioLayout tool with almost all of the coding has been performed in Edinburgh. |
Collaborator Contribution | Advice, shared publications, ideas. |
Impact | Outcomes from this work include the tool itself, numerous publications by ourselves and others. The original paper published in PLoS Comp Biol (2007) has been cited 243 times and the Nature Protocols paper (2009) has been cited 239 times (Google Scholar, Feb 2019). This was a multidisciplinary collaboration involving biologists, computer scientists, software engineers and mathematicians.The codebase for this software was licensed from the University of Edinburgh by Kajeka Ltd a UoE spinout company (www.kajeka.com). |
Start Year | 2006 |
Title | BioLayout Express3D |
Description | This is a network analysis tool and represents the product of over 15 years research and development. It uses a combination of high end 3D graphics, algorithms and user-friendly graphical interfaces to allow the user to explore and better analyse their data. |
IP Reference | |
Protection | Copyrighted (e.g. software) |
Year Protection Granted | 2014 |
Licensed | Yes |
Impact | The website currently receives hits from approx. 700 users a month from around the globe and is used 1500-2000 times a month. This tool has been used in analyses that have contributed to over 40 publications. The technology and know how developed is currently en route to being commercialised and developed further by a new spin out company called Kajeka. |
Title | BioLayout Express3D |
Description | BioLayout Express3D is a powerful tool for the visualization and analysis of network graphs. Network-based approaches are becoming increasingly popular for the analysis of complex systems of interaction and high dimensional data. Networks can be produced from a wide variety of relationships between entities. In biology this includes the interactions between individuals, disease transmission, sequence similarity, metabolic pathways, protein interactions, pathways, regulatory cascades, gene expression, clinical data. This tool represents the product of over 15 years research and development and uses a combination of high end 3D graphics, algorithms and user-friendly graphical interfaces to allow the user to explore and better analyse their data. |
Type Of Technology | Software |
Year Produced | 2007 |
Open Source License? | Yes |
Impact | The website currently receives hits from approx. 700 users a month from around the globe and is used 1500-2000 times a month. This tool has been used in analyses that have contributed to over 40 publications. |
URL | http://www.biolayout.org/ |
Title | Graphia Professional |
Description | This is a network analysis tool designed for the analysis of biological data. It is a commercial product produced by Kajeka Ltd a company founded on the IP and know how behind BioLayout Express3D. |
Type Of Technology | Webtool/Application |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | This is the first product of Kajeka Ltd and sales of it will help support the company as it grows. |
URL | https://kajeka.com/graphia-professional/ |
Title | NGS graph generator |
Description | This online tool allows users to generate tanscsequence graphs and then v |
Type Of Technology | Webtool/Application |
Year Produced | 2014 |
Impact | Improved visualisation and detection of splice variants and DNA assemblies. |
URL | http://seq-graph.roslin.ed.ac.uk/ |
Company Name | Kajeka Ltd |
Description | The biological sciences generate vast amounts of data from numerous analytical platforms; the data is big, complex and multi-layered. Like all big data, its analysis and correct interpretation represent a significant challenge. Kajeka produces a data analysis platform for the visualization and analysis of numerical matrices and networks. Our network analysis tools provide a rapid analysis pipelines producing visually intuitive network representations of data structure, whatever the source, allowing you to make more effective, more efficient, data-driven decisions - quickly. Company offering network analysis tools for the analysis of primarily of omics data. The company's first product, Graphia Professional, is based on IP and know how behind BioLayout Express3D. The company are now (Feb 2019) about to release a new network analysis platform for the analysis of biological data called Graphia. |
Year Established | 2014 |
Impact | The company is still at an early stage in its development but is beginning to gain commercial traction and clients for the software. We have won one Scottish Enterprise grant and have submitted a second for the development of our next generation software. We also secured seed investment of £230k last year. We are collaborating with academic scientists who develop network analysis algorithms or who use network analysis in their work. We have a number of ongoing discussions with major pharmaceutical and platform providers. |
Website | http://www.kajeka.com |
Description | Easter Bush Campus Open Day 2018 |
Form Of Engagement Activity | Participation in an open day or visit at my research institution |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Public/other audiences |
Results and Impact | The group ran two stands at this event, one on our work with honey bees including general interest items about beekeeping and bees, another was a display of the network analysis tools we have developed initially through BBSRC funding and more laterally as part of the spinout company Kajeka Ltd. Presenters Dr Mark Barnett and Prof Tom Freeman |
Year(s) Of Engagement Activity | 2018 |
URL | https://www.ed.ac.uk/roslin/community-engagement/public-events/campus-open-day-2018 |