Development of integrated web interfaces for Bioconductor genomic data analysis annotation and visualization tools

Lead Research Organisation: European Bioinformatics Institute

Department Name: Microarray Group

Abstract

Genomic data, particularly microarray expression profiling studies, comes in the shape of huge matrices of numbers, anywhere from 10,000 to 6,000,000 rows by hundreds to thousands of columns. These data need to be transformed, standardized, visualized, and annotated. The rows of these matrices report activity (expression levels) of genes under various conditions. The huge data volume, as well as the complexity involved with describing such experimental data, resulted in the creation of a few major public repositories for array-based high throughput genomics data: GEO (NCBI, USA) and ArrayExpress (EBI, Cambridge, UK). Our group at the EBI also has developed Expression Profiler (EP), a web-based platform for exploratory data analysis, which can provide some basic insights into the public data in ArrayExpress. The major thrust of the scientific community's work in creating tools for dealing with such large-scale data has concentrated within the set of open source command-line driven tools collectively called Bioconductor. These tools, or 'packages', are developed by leaders in specialized areas of application: normalization (mathematical methods of making data coming from different laboratories comparable), signalling pathway analysis, clustering analysis, meta-analysis, etc., and are therefore the de facto standard for cutting-edge functional genomics analysis technologies. At the same time, by and large the only users of Bioconductor remain the sophisticated bioinformaticians, while wet-lab biologists (experimentalists who produce the actual data) find the learning curve of the R environment too steep to learn, the R language too complex to master, and the command-line flexibility details too daunting. Moreover, even within Bioconductor, different packages offer different, often incompatible, paradigms of dealing with the data input, output and interchange. There is a definite, clear need to provide easy access to the power of Bioconductor for biologists involved in functional genomics and proteomics experimental research. This project proposes to utilise the EP analytical framework to develop a set of standard, unified look-and-feel web-based interfaces to core Bioconductor modules, which will also make use of the ArrayExpress database. The proposed system will enable biologists to upload securely their experimental data, analyse them with the best available Bioconductor algorithms and to compare or analyse them together with related public high-throughput data in the repository. The data analysis routines will take advantage of the high-power computing infrastructure available at the EBI, and the results will be stored within the system, accessible form anywhere in the world via a web-browser. A further unique advantage is provided by the integration of Bioconductor packages within a set of web interfaces: the interfaces can also be accessed as Web Services, i.e. can be incorporated in automatic data analysis workflows. In other words, even sophisticated bioinformaticians are likely to find this system useful (see attached letters of support).

Technical Summary

Background Bioconductor is a loosely organised set of open source tools for analysis, visualization and annotation of diverse types of genomic data. Bioconductor modules are implemented as packages in the command-line based statistical environment R. There exist a few formats for data interchange within Bioconductor (exprSet, MAList, etc.) but no single universal format is yet accepted. The Bioconductor toolbox is widely used by sophisticated bioinformaticians in all areas of genomic data analysis, however, no uniform BioConductor APIs exist and no single point of contact standard Web Service interfaces are available. Moreover, wet-lab biologists tend to find the command-line environment difficult to learn and use. Proposed work We propose to implement a set of AJAX (Asynchronous Javascript and XML)-based web-interfaces to core Bioconductor components. These interfaces will be implemented within the Expression Profiler (EP) platform, after necessary modifications/extensions are developed. The users will be able to upload their own data and/or analyse data available in the public microarray repository ArrayExpress. A set of RAD (Rapid Application Development) tools will be developed and distributed openly to the Bioconductor community for quick generation of such interfaces and, as transparently as possible, their integration within the proposed framework. The integrated Bioconductor interfaces will be also available for programmatic access as Web Services, and a system will be developed to keep the APIs up-to-date with latest developments in current Bioconductor releases.

Funded Value:

£90,641

Funded Period:

Nov 06 - Nov 07

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/E001653/1

Principal Investigator:

Alvis Brazma

Research Topic:

Unclassified

Organisations

European Bioinformatics Institute (Lead Research Organisation)

People	ORCID iD
Alvis Brazma (Principal Investigator)
Wolfgang Huber (Co-Investigator)
Misha Kapushesky (Researcher Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Beisvåg V (2011) Contributions of the EMERALD project to assessing and improving microarray data quality. in BioTechniques

Goncalves A (2011) A pipeline for RNA-seq data processing and quality assessment. in Bioinformatics (Oxford, England)

Hartley M (2022) The BioImage Archive - Building a Home for Life-Sciences Microscopy Data. in Journal of molecular biology

Kapushesky M (2012) Gene Expression Atlas update--a value-added database of microarray and sequencing-based functional genomics experiments. in Nucleic acids research

Rung J (2013) Reuse of public genome-wide gene expression data. in Nature reviews. Genetics

Rustici G (2013) ArrayExpress update--trends in database growth and links to data analysis tools. in Nucleic acids research

Sarkans U (2021) REMBI: Recommended Metadata for Biological Images-enabling reuse of microscopy data in biology. in Nature methods

Key Findings
Impact Summary


Description	Effective parallelisation of R-based computations is possible and can be used in high performance computational set-up
Exploitation Route	This has helped us to process data submitted to the EBI by many external users. The we have over 5000 unique users of the processed data monthly
Sectors	Environment,Pharmaceuticals and Medical Biotechnology


Description	Routinely used in the EMBL-EBI Functional Genomics Services for various computational production tasks, and 2) has external users to perform computations on data at the EBI databases.
First Year Of Impact	2012
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Policy & public services

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications