AGILE: A Cloud Approach to Automatic Gene Expression Pattern Recognition and Annotation Over Large-Scale Images

Lead Research Organisation: Manchester Metropolitan University
Department Name: Sch of Computing, Maths and Digital Tech

Abstract

Modern biomedical research makes significant use of large datasets. Cloud computing is emerging as a cost-effective solution by providing virtual computers and storage disks on demand to store and process massive data efficiently without large upfront costs.

Despite some progress made, the use of cloud computing in the biomedical research is still at the very early stage. There exist various concerns on how to best utilise the cloud for accelerating large-scale biomedical applications. Especially, can a biomedical application be directly migrated to the cloud without requiring any modification? How to develop a cloud-based biomedical application? What are the performance and the cost of an application in the cloud? Are the performance and the cost acceptable? Do we have optimal methods to keep both performance and the overall cost of applications within the acceptable range in the cloud?

This project will develop a cloud approach for a real biomedical data intensive task for effective gene expression pattern recognition and annotation over large-scale image data through addressing the concerns above. This task is chosen largely for its importance in the biomedical research. This type of intensive data-analysis task is increasingly common in the biomedical sciences. This particular task concerns developmental anatomy of mouse embryo: it is of great interest to identify gene interactions and networks that are associated with developmental and physiological functions in the embryo by using anatomical annotation. The gene expression pattern recognition and annotation represents labelling embryo images with anatomical terms for mouse development. If an image is tagged with a term, it means the corresponding anatomical component shows expression of that gene. Currently, this task is mainly taken manually by domain experts. However, with the availability of the vast amount of data, a manual annotation is expensive and time consuming. Additionally, the manual annotation may also produce the inconsistency of labels across images introduced by the human annotators as it proves to be highly subjective. To alleviate issues with the manual annotation, we have employed data mining techniques to automatically identify an anatomical component in the embryo image and annotate the image using the provided terms. As this task involves the use of very large-scale images, we intend to exploit cloud computing for this task to address the massive data problems.

It is expected that the successful completion of this project will provide a typical exemplar for accessing and exploiting cloud computing technologies to analyse large-scale image-based biomedical data. An important, and novel, aspect of this proposal is that the major concerns that limit the more widespread use of cloud computing for biomedical applications will be addressed. The theoretical component of the work aims to provide (1) a practical user-friendly biomedical data-mining tool based on the cloud for effective gene expression pattern recognition and annotation and (2) a set of standard services (e.g. image processing algorithms, data mining algorithms) and a novel automatic data reuse mechanism for performance enhancement and cost reduction, which can be reused and plugged into the class of similar biomedical applications.

Technical Summary

Biomedical research is moving to a data-centric paradigm, which requires large computing resources to scale with the increasing volume and complexity of biological data. Cloud computing is emerging as a cost-effective solution to process and store massive data on demand. However, there are numbers of challenges that limit its widespread use in the biomedical data-driven applications. Particularly, how to develop a biomedical application based on the cloud? How to optimise the performance and reduce the overall cost of the application within the acceptable range?

With these questions in mind, this project proposes a cloud approach to develop a cloud-based biomedical application tool for efficient gene expression pattern recognition and identification, which involves the use of large-scale images. The project addresses the challenges hindering the wider adoption of cloud computing for biomedical research. The theoretic component of the work aims to 1) develop a user-friendly, cloud-based gene expression pattern recognition and annotation tool by using parallel processing models to enable the application to run on the cloud; and 2) provide a novel data reuse mechanism to automatically determine which data should be deleted or stored for optimising the performance and reducing the overall cost of applications and a set of standard services, which can be reused and plugged in the class of similar applications.

These consist of main contribution and the novelty of the project.

Planned Impact

The potential beneficiaries this work might be of interest to can be broadly classified into two groups including specific users and wider users.

1) Specific users
1.1) Biomedical researchers who use the gene-expression data resources, such as EMAGE that have major genomic-wide high-throughput data. Those resources represent massive untapped "data-mines" that will be used for decades to extract information using novel image-processing and data-mining algorithms. It is unlikely that existing resource providers will be able to satisfy this demand. This research will propose a cloud approach to develop a practical cloud-based tool for gene expression recognition and annotation. On the other hand, it will also test the feasibility of using cloud computing for facilitating data-intensive biomedical research.
1.2) Researchers in areas of other disciplines such as environmental, engineering and physical sciences, those who face a "data deluge" problem that requires support from high performance infrastructures, and would like to use and/or build infrastructures based on cloud computing. This work will inform good practice for researchers. They can learn and benefit from the development of this project to solve domain-specific data intensive computing problems. To date, a few leading international research institutions such as Cornell, Pudue Universities, CERN and Harvard Medical School have taken steps to move their computing infrastructures into the cloud. The output of this research will contribute to raising the international standing of UK research in the use and exploration of cloud computing as a platform.
1.3) A skilled researcher, fluent in domain-specific applications of cloud computing will be produced by the unique interdisciplinary nature of this project, which brings together researchers from the biomedical and computer sciences in the context of modern computational biomedical practice.
1.4) University students and lecturers in biomedical sciences can use the tools developed in the project for teaching and learning.
1.5) Information service management teams in universities which build computer infrastructures. One result of the project is the provision of good practice advice for exploiting cloud computing and information in order to manage the increasing complexity of the infrastructures for data intensive applications, and so achieve operational and environmental efficiency.
1.6) Practitioners in industry who design and implement cloud computing. The results of this work will provide feedback to cloud providers for optimization of cloud computing technologies and quality enhancement of cloud services offered by providers.
1.7) Policy makers and research councils. Research councils can use the outcomes as evidence to inform and influence policy-makers so that the policy makers can formulate policies and strategies for future investment and adoption of cloud computing.

2) Wider users
General public and public sectors, who may use cloud services but without awareness of that fact, and who have little knowledge about the cloud and gene expression annotation. One outcome of the research will be to educate public audiences and raise awareness of the cloud computing and gene expression annotation such as concepts and underlying technologies by distributing flyers, posters and involving in outreach activities.

In order to ensure that these potential beneficiaries have the opportunity to benefit from this research, we will adopt different impact activities. These activities include the dissemination of project deliverables and software through the project website, research publications in prestigious journals (e.g., Bioinformatics, BMC biology, PlosOne) and appropriate conferences (e.g., IEEE-eScience, HPDC), seminars for biomedical researchers and researchers from other disciplines, and public engagement activities for general audience. The detailed plan is described in the document of pathway to impact.

Publications

10 25 50

publication icon
Liangxiu Han (Author) (2013) Automatic Data Reuse for Accelerating Data Intensive Applications in the Cloud in 2013 8th International Conference for Internet Technology and Secured Transactions (ICITST) (IEEE)

publication icon
Zheng Xie, Liangxiu Han And Richard Baldock (2013) Enhancing Parallelism of Data-Intensive Bioinformatics Applications in EUROSIM '13 Proceedings of the 2013 8th EUROSIM Congress on Modelling and Simulation (IEEE Computer Society)

publication icon
Zheng Xie, Liangxiu Han And Richard Baldock (2013) Augmented PetriNet Cost Model for Optimisation of Large Bioinformatics Workflows using Cloud in 7th European Modelling Symposium on Mathematical Modelling and Computer Simulation (IEEE)

 
Description Our key findings include:
1) We have generated new knowledge which have been published in 2 Journal paper and 3 conference papers for our key findings, specifically
• Development of cost model for evaluating the cost of running an application in the cloud and data reuse model in order to accelerate the large-scale data handling and processing in the cloud;
• Development and implementation of a cloud-based biomedical application- Automatic gene expression pattern recognition and annotation over large-scale image data;
2) Through this project, we have developed and further enhanced our research methods and skills in relation to big data research (such as cloud computing, data analytics approaches).
3) Additionally, based on the findings, we have also developed a number of research collaborations with other institutions and organisations in different domains (e.g. food security, energy, city planning, etc.).
4) The developed algorithms (e.g. parallel processing, data mining algorithms) will be used in teaching for training data scientists (MSc. programme - High performance computing and big data)
Exploitation Route Underlying technologies such as cloud, big data analytics have been applied to other areas such as in food security, agriculture.
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Creative Economy,Digital/Communication/Information Technologies (including Software),Electronics,Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Manufacturing, including Industrial Biotechology,Culture, Heritage, Museums and Collections,Retail,Security and Diplomacy,Transport

 
Description We have taken many forms for generating impact, such as, organising a big data workshop, disseminating findings through conferences and journal papers, delivering public lectures and seminars. 1. Through this project, we had developed a number of collaborations and applied the technologies into different domains, such as food security, energy and city planning, which help fostering the economic competitiveness of UK and beyond as well as enhancing quality of life. 2. This project also provided a solid foundation, which allowed us to gain the new projects in relation to precision agriculture ( funded by Agri-Tech China Network+ and British Council Research Institutional Links). We have developed new partnerships with industrial and academic partners from developing countries such as China and Malaysia. The outputs of these project will contribute to food security both nationally and internationally. 3. We had also delivered a big data workshop and a public professorial lecture and seminars, which raised awareness of big data research 4. We are continuously using the new knowledge arisen from this project and engaging with individuals, organisations and nations to address societal challenges to generate wider impact.
First Year Of Impact 2018
Sector Agriculture, Food and Drink,Other
Impact Types Societal

 
Description Agri-Tech in China: Newton Network + (ATCNN) -- Proof-of-Concept Award
Amount £36,808 (GBP)
Organisation Science and Technologies Facilities Council (STFC) 
Sector Public
Country United Kingdom
Start 04/2017 
End 08/2017
 
Description Agri-Tech in China: Newton Network + (ATCNN) -- Small project award
Amount £45,771 (GBP)
Funding ID QP003 
Organisation Rothamsted Research 
Sector Academic/University
Country United Kingdom
Start 10/2018 
End 03/2019
 
Description COPE automatic diagnosis of crop diseases from images
Amount £49,634 (GBP)
Organisation Imperial College London 
Department Sustainable Society Network+
Sector Academic/University
Country United Kingdom
Start 07/2014 
End 01/2015
 
Description Crop disease detection using computer-aided approaches 
Organisation Fera Science Limited
Country United Kingdom 
Sector Public 
PI Contribution We have initiated and developed a research collaboration with Food and Environment Agency (Fera) in relation to crop disease detection using computer-aided approaches. We are currently applying image processing and data mining/analytics approaches for detecting crop diseases ( This collaboration just started and in progress)
Collaborator Contribution Food and Environment Agency (Fera) has provided samples of crop images and domain knowledge in plant pathology in relation to crop diseases to this research collaboration
Impact We have just started this collaborative work (in progress). This is a typical multidisciplinary work, which brings plant pathologists and computer scientists together to address societal challenges in relation to food security and agriculture and will generate potential economic and societal impacts.
Start Year 2014
 
Description Enhancing Parallelism of Data-Intensive Bioinformatics Applications 
Organisation Medical Research Council (MRC)
Department MRC Human Genetics Unit
Country United Kingdom 
Sector Academic/University 
PI Contribution Bioinformatics data-resources collected from heterogeneous and distributed sources can contain hundreds of Terra-Bytes and the efficient exploration on these large amounts of data is a critical task to enable scientists to gain new biological insight. In this work, an MPI-based parallel architecture has been designed for enhancing performance of biomedical data intensive applications. The experiment results show the system has achieved super-linear speedup and high scalability.
Collaborator Contribution Prof. Richard Baldock and his group have provided large-scale data sets of mouse embryo and domain knowledge in relation to gene expression patterns and their time in the collaboration
Impact This is a typical multidisciplinary collaboration and we have produced a number of publications. Please refer to links: 1. http://www.emouseatlas.org/emap/about/publications.html 2. http://www.scmdt.mmu.ac.uk/agile/index.htm
Start Year 2012
 
Description An invited talk at British Library: Large-scale data processing and data analytics on images 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Third sector organisations
Results and Impact An invited talk at British Library: Large-scale data processing and data analytics on images ( with applications to Health, Food). https://www.eventbrite.co.uk/e/curious-images-tickets-14438270255. Dr. Han gave a talk about the work she has been doing using image processing, machine learning and parallel processing / cloud computing in the application areas of health / biomedical sciences and food. For example, eye disease detection from retinal images, gene expression patterns and crop disease detection from crop images.
Year(s) Of Engagement Activity 2014
URL https://www.eventbrite.co.uk/e/curious-images-tickets-14438270255
 
Description Big data processing and analytics in the Cloud (Seminar/an invited talk) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact I was invited to give a talk in University of Kent. The talk attracted attentions and discussions.

Potential collaborations
Year(s) Of Engagement Activity 2014
URL http://www.kent.ac.uk/calendar/?eid=4770713F-CDB3-4526-BEFB-77363CF124E1
 
Description Looking to the Cloud ((Big data research in the Cloud)--Press release 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? Yes
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact This article aimed to share the information about the project and stimulate thinking.

Potential collaborations
Year(s) Of Engagement Activity 2013
URL http://www.sci-eng.mmu.ac.uk/research_matters/winter_2013/winter-2013.pdf
 
Description Public Professorial Lecture: Surfing the data tsunami: A new paradigm for big data processing and analytics 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact The aim of the lecture/event was to raise awareness of our research/activities to audiences across the city. The lecture attracted audiences not only from MMU but also from the public , who are interested in big data processing and analytics. There were some interesting questions and discussions afterwards.

Potential collaborations
Year(s) Of Engagement Activity 2014
URL http://www.sci-eng.mmu.ac.uk/50/gallery/default.asp?set=set:72157645729032001%3B25%3BProf.%20Lecture...
 
Description The International Workshop on Big Data and Smart Sustainable Society 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Dr. Han organised an international workshop in relation to "Big Data and Smart Sustainable Society". The workshop increased research profile and reach the wide audience both nationally and internationally.
Year(s) Of Engagement Activity 2015
URL http://www2.docm.mmu.ac.uk/STAFF/L.Han/BigData-2015/index.htm