Development of integrated web interfaces for Bioconductor genomic data analysis annotation and visualization tools
Lead Research Organisation:
European Bioinformatics Institute
Department Name: Microarray Group
Abstract
Genomic data, particularly microarray expression profiling studies, comes in the shape of huge matrices of numbers, anywhere from 10,000 to 6,000,000 rows by hundreds to thousands of columns. These data need to be transformed, standardized, visualized, and annotated. The rows of these matrices report activity (expression levels) of genes under various conditions. The huge data volume, as well as the complexity involved with describing such experimental data, resulted in the creation of a few major public repositories for array-based high throughput genomics data: GEO (NCBI, USA) and ArrayExpress (EBI, Cambridge, UK). Our group at the EBI also has developed Expression Profiler (EP), a web-based platform for exploratory data analysis, which can provide some basic insights into the public data in ArrayExpress. The major thrust of the scientific community's work in creating tools for dealing with such large-scale data has concentrated within the set of open source command-line driven tools collectively called Bioconductor. These tools, or 'packages', are developed by leaders in specialized areas of application: normalization (mathematical methods of making data coming from different laboratories comparable), signalling pathway analysis, clustering analysis, meta-analysis, etc., and are therefore the de facto standard for cutting-edge functional genomics analysis technologies. At the same time, by and large the only users of Bioconductor remain the sophisticated bioinformaticians, while wet-lab biologists (experimentalists who produce the actual data) find the learning curve of the R environment too steep to learn, the R language too complex to master, and the command-line flexibility details too daunting. Moreover, even within Bioconductor, different packages offer different, often incompatible, paradigms of dealing with the data input, output and interchange. There is a definite, clear need to provide easy access to the power of Bioconductor for biologists involved in functional genomics and proteomics experimental research. This project proposes to utilise the EP analytical framework to develop a set of standard, unified look-and-feel web-based interfaces to core Bioconductor modules, which will also make use of the ArrayExpress database. The proposed system will enable biologists to upload securely their experimental data, analyse them with the best available Bioconductor algorithms and to compare or analyse them together with related public high-throughput data in the repository. The data analysis routines will take advantage of the high-power computing infrastructure available at the EBI, and the results will be stored within the system, accessible form anywhere in the world via a web-browser. A further unique advantage is provided by the integration of Bioconductor packages within a set of web interfaces: the interfaces can also be accessed as Web Services, i.e. can be incorporated in automatic data analysis workflows. In other words, even sophisticated bioinformaticians are likely to find this system useful (see attached letters of support).
Technical Summary
Background Bioconductor is a loosely organised set of open source tools for analysis, visualization and annotation of diverse types of genomic data. Bioconductor modules are implemented as packages in the command-line based statistical environment R. There exist a few formats for data interchange within Bioconductor (exprSet, MAList, etc.) but no single universal format is yet accepted. The Bioconductor toolbox is widely used by sophisticated bioinformaticians in all areas of genomic data analysis, however, no uniform BioConductor APIs exist and no single point of contact standard Web Service interfaces are available. Moreover, wet-lab biologists tend to find the command-line environment difficult to learn and use. Proposed work We propose to implement a set of AJAX (Asynchronous Javascript and XML)-based web-interfaces to core Bioconductor components. These interfaces will be implemented within the Expression Profiler (EP) platform, after necessary modifications/extensions are developed. The users will be able to upload their own data and/or analyse data available in the public microarray repository ArrayExpress. A set of RAD (Rapid Application Development) tools will be developed and distributed openly to the Bioconductor community for quick generation of such interfaces and, as transparently as possible, their integration within the proposed framework. The integrated Bioconductor interfaces will be also available for programmatic access as Web Services, and a system will be developed to keep the APIs up-to-date with latest developments in current Bioconductor releases.
Publications
Beisvåg V
(2011)
Contributions of the EMERALD project to assessing and improving microarray data quality.
in BioTechniques
Goncalves A
(2011)
A pipeline for RNA-seq data processing and quality assessment.
in Bioinformatics (Oxford, England)
Hartley M
(2022)
The BioImage Archive - Building a Home for Life-Sciences Microscopy Data.
in Journal of molecular biology
Kapushesky M
(2012)
Gene Expression Atlas update--a value-added database of microarray and sequencing-based functional genomics experiments.
in Nucleic acids research
Rung J
(2013)
Reuse of public genome-wide gene expression data.
in Nature reviews. Genetics
Rustici G
(2013)
ArrayExpress update--trends in database growth and links to data analysis tools.
in Nucleic acids research
Sarkans U
(2021)
REMBI: Recommended Metadata for Biological Images-enabling reuse of microscopy data in biology.
in Nature methods
Description | Effective parallelisation of R-based computations is possible and can be used in high performance computational set-up |
Exploitation Route | This has helped us to process data submitted to the EBI by many external users. The we have over 5000 unique users of the processed data monthly |
Sectors | Environment,Pharmaceuticals and Medical Biotechnology |
Description | Routinely used in the EMBL-EBI Functional Genomics Services for various computational production tasks, and 2) has external users to perform computations on data at the EBI databases. |
First Year Of Impact | 2012 |
Sector | Digital/Communication/Information Technologies (including Software) |
Impact Types | Policy & public services |