Cloud-SPAN: Specialised analyses for environmental 'omics with Cloud-based High Performance Computing

Lead Research Organisation: University of York
Department Name: Biology

Abstract

Environmental Biotechnology (EB) addresses global challenges using engineered microbial systems for environmental protection, bio-remediation and resource recovery. It is a critical and expanding area for the UK and underpins some of the world's most important industries. This is acknowledged by the funding invested in the creation of BBSRC's Networks in Industrial Biotechnology and Bioenergy (NIBBs).
A deep mechanistic understanding of the complex microbial communities involved in the biological cycling of global resources is essential to meet global challenges such as Net Zero, waste management and increased demand. The complexity of these microbiomes can be orders of magnitude larger than those found in the human gut, requiring different approaches to experimental design and analysis with High Performance Computing (HPC). However, EB is an interdisciplinary field that attracts researchers from a broad range of disciplines including Mathematics, Engineering, Biology, Social Sciences, Management, Physics and Chemistry and big data 'omics analyses on HPC systems are often not core skills for such researchers.
This proposal aims to develop and deliver highly accessible resources that will upskill these interdisciplinary researchers so that they are able to generate and analyze big data relating to EB using Cloud HPC. Although infrastructure and resources exist for microbial 'omics (e.g. JGI's IMG/M, MG-RAST, Galaxy, CLIMB, EBI) there is a lack of systematic training tightly linked to the EB domain and documentation is often focussed on technical proficiency rather than contextualised with a strong understanding of experimental design. We will provide foundational training and develop and deliver new advanced modules covering the specialised skills required to generate and analyse 'omics data using Cloud HPC resources. These will include experimental design and statistical modules to ensure researchers can generate data appropriate to investigate their research question. Modules will deploy cloud-based containerised instances provided by Google Education and Amazon Web Services (AWS) for exemplar workflows free to the learner. They will form a complete training resource with fully articulated prerequisites and learning objectives that can be used for in-person or online tutor-led workshops or self-paced learning. Our proposal offers structured Learning Paths from the statistical skills required for robust experimental design through to the reproducible execution and interpretation of 'omics analyses with HPC to cater to researchers with differing levels of previous experience and which allow self-assessment of training needs. We also provide Diversity Scholarships to enable members of underrepresented groups to participate in online or in-person training.
The collaboration between the University of York and the Software Sustainability Institute (SSI) brings together excellence in data science pedagogy and environmental 'omics research with the SSI's UK-leading expertise in research computing and community building. This will ensure the training developed genuinely complements, and aligns with, existing materials to enhance national provision.
Sustainability will be fostered by making the resources Findable, Accessible, Interoperable and Reusable (FAIR), providing cross-platform images for deployment and by developing and proactively engaging with a Community of Practice and providing Code Retreats for the supported practice of methods to participants' own data. In addition, "Cloud Administration Guides" will be developed for institutional HPC Teams to run specialised modules with their own resources. These Guides will be supported 1-to-2 day training by Cloud-SPAN systems administrators.
The project will be promoted by our partners, the SSI, Google Education, AWSand the N8 Centre of Excellence in Computationally Intensive Research and through delivery of conference talks and seminars.

Technical Summary

Environmental Biotechnology (EB) is an interdisciplinary field involving advanced molecular and applied microbiologists, environmental chemists and engineers that addresses global challenges using engineered microbial systems. This proposal aims to trainer these researchers to generate, analyse and mine big data relating to EB microbiomes which are larger than those found in the human gut and require different approaches to both measurement and analysis in order to manage reagent costs and effectively leverage available HPC resources.
Easily accessible and scalable HPC-based training is required to provide researchers with the skills and self-confidence to manipulate and analyse big data generated from 'omics technologies and generate biological insights from these highly interconnected systems. This area of bioinformatics involves a steep learning curve which can be confounded by the need to install packages with multiple dependencies onto different HPC architectures based on what is available at a researcher's home institution, even before the user can engage with writing scripts to manage workflows, manipulate or visualise data, or manage job schedulers. We will deploy Cloud-based containerised instances which are (1) accessible to anyone anywhere as long as they have an adequate internet connection, (2) have a very low hardware entry requirement and (3) allow for easily scalable and replicable installations of software that will not become deprecated as quickly as might occur on a local server. The cloud providers we are working with run grant schemes that provide significant resources to researchers that will support deployment of production instances of the images we will generate. We expect that our resources will be easily deployed and used by groups who do not necessarily have devoted bioinformaticians or expertise in HPC, providing a cost-effective route to useful analyses for researchers in a strategically key area of expansion in the biosciences.

Publications

10 25 50
 
Description Collaboration with Software Sustainability Institute 
Organisation Software Sustainability Institute
Country United Kingdom 
Sector Public 
PI Contribution The project team participates in quarterly management meeting with Neil Chue Hong, Director of the Software Sustainability Institute.
Collaborator Contribution Neil Chue Hong, Director of the Software Sustainability Institute participates in a quarterly management meeting to provide guidance and strategic advice on different aspects of the project.
Impact During our meetings we are able to discuss best practice in regards to; creating training materials, managing educational activities, engagement of the general public.
Start Year 2021
 
Description EBnet Collaboration 
Organisation UK Environmental Biotechnology Network
Country United Kingdom 
Sector Academic/University 
PI Contribution James Chong and Sarah Forrester are active members within the EBnet Working Group. Through their work with EBnet they are able to publicise the Cloud-SPAN project, via sharing information and delivering talks at webinars.
Collaborator Contribution EBnet support and promote the training opportunities created through the Cloud-SPAN project. The collaboration also allows members to exchange expertise in the field of HPC driven microbial genomics research, which in turn improves the quality of the Cloud-SPAN training resources.
Impact James Chong and Sarah Forrester are active members within the EBnet Working Group. Through their work with EBnet they are able to publicise the Cloud-SPAN project, via sharing information and delivering talks at webinars.
Start Year 2021
 
Title Web-based App - Self-Assessment Quiz 
Description Using the Shiny an R package, an online interactive web-based app was created in order to evaluate the level of competence of a participant. 
Type Of Technology Webtool/Application 
Year Produced 2022 
Impact This online self-assessment tool has been invaluable to determine the level of competence of participants; based on the results course participants can continue their registration to either an introductory course or an advanced course. 
URL https://shiny.york.ac.uk/er13/prenomics-quiz/#section-why
 
Description Blog re Genomics Course November 2021 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact A blog was written evaluating the first Genomics Course which was delivered in November 2021. The blog was posted on the Cloud-SPAN forum and included on the SSI's website https://software.ac.uk/news/review-cloud-spans-genomics-course
Year(s) Of Engagement Activity 2021
URL https://cloudspan.peerboard.com/post/1021906833
 
Description Creation of the Cloud-SPAN Handbook 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact In our Cloud-SPAN community we encourage everyone to come together to find solutions to problems and exchange experiences and knowledge. Our aim is to build a friendly and involved community of people who have used our resources, are interested in our resources, or who have expertise in the areas we cover. Ways to contribute include attending one of our courses, asking/answering questions on our community forum and making suggestions for improvements to our courses.

Handbook
This handbook is intended as a reference for both the core Cloud-SPAN team and for our wider community of learners. It's where you'll find our Code of Conduct, contributing guidelines and other practical information which will help you make the most of our resources in a friendly, understanding environment.
Year(s) Of Engagement Activity 2021
URL https://cloud-span.github.io/CloudSPAN-handbook/index.html
 
Description Creation of the Cloud-SPAN LinkedIn Account 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact A Cloud-SPAN account was created to help provide an online presence for the Cloud-SPAN project. Via LinkedIn the different training activities are promoted and allows the general public to ask any questions they may have.

Impact: enables the dissemination of the project details to a wider audience and generates registrations to current activities
Year(s) Of Engagement Activity 2021
URL https://www.linkedin.com/company/cloud-span
 
Description Creation of the Cloud-SPAN Twitter Account 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A Cloud-SPAN Twitter Account was created to help provide an online presence for the Cloud-SPAN project. It also enables the following:
1. Promote the Cloud-SPAN project to an international online audience
2. Promote the registration of various activities
3. Host News stories and blogs
4. Promote information regarding scholarships

Impact: enables the dissemination of the project details to a wider audience and generates registrations to current activities
Year(s) Of Engagement Activity 2021
URL https://twitter.com/SpanCloud
 
Description Creation of the Cloud-SPAN online Forum 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact In our Cloud-SPAN community we encourage everyone to come together to find solutions to problems and exchange experiences and knowledge. Our aim is to build a friendly and involved community of people who have used our resources, are interested in our resources, or who have expertise in the areas we cover. Ways to contribute include attending one of our courses, asking/answering questions on our community forum and making suggestions for improvements to our courses.

The Cloud-SPAN forum is a place to ask questions, pick people's brains and share any insights you've gained during or after one of our courses. It will be the main hub of the Cloud-SPAN community of practice. We strongly encourage you to engage with the Cloud-SPAN community to enhance your learning and understanding.

Impact: enables the dissemination of the project details to a wider audience and generates registrations to current activities. It also provides an opportunity for audience to learn and develop their knowledge and skills.
Year(s) Of Engagement Activity 2021
URL https://cloudspan.peerboard.com/
 
Description Creation of the Cloud-SPAN website 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact A Cloud-SPAN website was created to help provide an online presence for the Cloud-SPAN project. It also enables the following:
1. Promote the Cloud-SPAN project to an international online audience
2. Promote and organise registration of various activities
3. Host News stories and blogs
4. Promote information regarding scholarships
5. Provides a platform for individuals to ask for further information or ask any questions

Impact: enables the dissemination of the project details to a wider audience and generates registrations to current activities
Year(s) Of Engagement Activity 2021
URL https://cloud-span.york.ac.uk/
 
Description EBNet Webinar: Using Big Data Approaches to Understand Microbial Communities 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact EBNet Webinar: Using Big Data Approaches to Understand Microbial Communities
Thursday, 10th February 2022 at 13.00 - 14.15.
The SESSION RECORDING is available here https://www.youtube.com/watch?v=1QH0JK0X0Xw

EBNet are hosting a series of specialist webinars to support knowledge exchange amongst members. "Using Big Data Approaches to Understand Microbial Communities". Hear the latest developments from top speakers and participate in the online chat to engage with questions.

This fascinating session is brought to you by the Chairs: Dr Sarah Forrester, the Chong Group, Dept. of Biology, University of York & Dr Bing Guo.

Dr Sarah Forrester is a PDRA within James Chong's group within the Biology department at the University of York. She gained her PhD at the University of Liverpool in 2016 using multi 'omic approaches to analyse parasite genomic data, and has worked since then on a range of microbial systems and used a variety of bioinformatic methods. She performs HPC driven microbial genomics research and delivers bioinformatics training. As a 2022 Software Sustainability fellow and a certified Software Carpentry instructor, she is passionate about instilling good bioinformatic practises into her training. She is also involved in the preparation and delivery of the material for Cloud-SPAN: Specialised analyses for environmental 'omics with Cloud-based High Performance Computing , see https://cloud-span.york.ac.uk/.

TALK TITLE: INTRODUCTION TO THE EBNET BIOINFORMATICS WORKING GROUP
Prof James Chong is a Royal Society Industry Fellow and Professor in the Department of Biology at the University of York, where he runs a research group exploiting a range of 'omics techniques to understand microbial community dynamics, as well as leading the EBNet Working Group "Bioinformatics Training for Microbial Environmental Biotechnologies". His group is involved in generating microbial community metagenomics, meta-transcriptomics and metabolomics datasets. His group use established analytical pipelines, but also develop their own bespoke scripts for data analysis. Insight into the application of 'omics techniques, and the ways in which they can be applied to environmental biotechnology use cases to greater understand microbial community dynamics, has driven his desire to develop bioinformatic training resources. This is currently being supported by the UKRI Grant Cloud-SPAN: Specialised analyses for environmental 'omics with Cloud-based High Performance Computing, and is co-led by James, see https://cloud-span.york.ac.uk/.

Impact: enables the dissemination of the project details to a wider audience and generates registrations to current activities
Year(s) Of Engagement Activity 2022
URL https://ebnet.ac.uk/ebnet-rc22-bigdata/
 
Description EBNet Working Group Coordinator 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact James Chong is the Working Group Chair for EBnet. This WG aims to create Bioinformatics training for microbial Environmental Biotechnologies. In this role James is able to make new connections and publicise the work of Cloud-SPAN.
Year(s) Of Engagement Activity 2021
URL https://ebnet.ac.uk/about/wg-details/wg-bioinformatics/
 
Description Online Training Course: Genomics November 2021 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact The online training course on Genomics was delivered by the following members of the project team; Emma Rand, Jorge Buenabad-Chavez, Sarah Forrester, Evelyn Greeves, and Annabel Cansdale. The course was delivered over 4 half days to 26 UK-based participants.

Expected learning outcomes - by the end of the training course participants were able to:
• structure their data and metadata and plan for an NGS project
• organise and document genomics data and bioinformatics workflows
• understand what information is needed by a sequencing facility
• gain practice navigating file systems, creating, copying, moving, and removing files and directories
• use command-line tools to assess read quality and perform quality control
• align reads to a reference genome, and identify and visualise sequence variants
• work with Amazon AWS cloud computing and transfer data between a local computer and cloud resources

Feedback from participants was very positive and many stated that they felt their abilities had improved after attending the course, as highlighted in this blog post.

Impact: provides an opportunity for the learner to develop their knowledge and skills.
Year(s) Of Engagement Activity 2021
URL https://cloud-span.github.io/genomics01-intro/
 
Description Online Training Course: Prenomics March 2022 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact The online training course on Prenomics was delivered by the following members of the project team; Emma Rand, Jorge Buenabad-Chavez, and Evelyn Greeves. The course was delivered over 2 half days to 28 UK-based participants.

The Prenomics module is designed to prepare people for the Cloud-SPAN Genomics module . We have found that people taking the Genomics module can vary the amount of experience they have had in navigating file systems and using the command line. We have designed the Prenomics module to allow more time for those with less experience to cover some foundation concepts. We have a Self-assessment Quiz to help you decide if you would benefit from attending Prenomics before the Genomics module. The Prenomics and Genomics modules are based on the Data Carpentry's Genomics Workshop. Prenomics teaches the basics of command-line programming, including: (1) file directory structure, (2) use of command-line utilities to connect to and use cloud computing and storage resources and (3) basic shell commands for file navigation and basic script writing.

Impact: allows participants to develop their skills and knowledge in this area.
Year(s) Of Engagement Activity 2022
URL https://cloud-span.github.io/prenomics00-intro/
 
Description Participant on Open Life Science Mentorship Programme 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Team member Evelyn Greeves, is a participant on the OLS mentorship programme. This allows her to widen her expertise in the area of establishing and maintaining an online community.

Impact: via the OLS network the work of Cloud-SPAN can be publicised.
Year(s) Of Engagement Activity 2022
URL https://openlifesci.org/ols-5/projects-participants/
 
Description Presentation at University of York's Head of Department Meeting 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Projects Leads, Emma Rand and James Chong, delivered an informative presentation which covered an overview of the project including; goals, strategy and training resources. This talk generated new registrations for the Prenomics and Genomics training courses and allowed questions to be addressed from the general public.
Year(s) Of Engagement Activity 2022
URL https://drive.google.com/file/d/1pO-DXIR3p8XncrvGxlf5KLBiRPQYJfxp/view?usp=sharing
 
Description Presentation at University of York's Open Day 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact An open day was hosted at the University of York, Emma Rand the Co-Project Lead delivered an informative presentation on the goals of the Cloud-SPAN project. This allowed individuals to ask any questions regarding the project and helped to promote registrations for the Cloud-SPAN activities.
Year(s) Of Engagement Activity 2021
 
Description UK Conference for Bioinformatics and Computational Biology talk 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact The UK Conference of Bioinformatics and Computational Biology 2021 brings together biologists, bioinformaticians, computer scientists, software engineers and data scientists across the life sciences, to share innovations, applications and best practice in their fields.
We took part in a workshop session for UKRI Innovation Scholars - Data Science Training in Health and Bioscience, for all the projects awarded as part of this UKRI grant call to hear about the training they are developing in data for life scientists. This session was relevant to those working in life science data who wanted to learn more about the future of training, and was especially relevant to people who already run training in data science in the areas of health and bioscience.
This allowed networking with potential particpaints of Cloud-SPAN training and with those able to publicise and promote the our project
Year(s) Of Engagement Activity 2021
URL https://www.earlham.ac.uk/uk-conference-bioinformatics-and-computational-biology-21#Programme-5