Bilateral NSF/BIO-BBSRC:A Metagenomics Exchange - enriching analysis by synergistic harmonisation of MG-RAST and the EBI Metagenomics Portal

Lead Research Organisation: European Bioinformatics Institute
Department Name: Sequence Database Group

Abstract

Micro-organisms are found in virtually all environments. Typically, they form the base of the food chain (such as plankton in the sea) and play essential roles in their ecosystems. There is often a complex interplay between different micro-organisms, with some organisms requiring that others be present in order for them to exist. When there is an imbalance within a community, this can lead to severe effects, such as disease in the human gut, or the inability for plants to grow efficiently in soil. An understanding of the composition and interplay within the communities allows us to potentially manipulate them. Thus, there is intense research into micro-organism communities in many different fields, such as improving livestock yields, the recovery from bacterial infections using fecal transplants and the efficient production of biofuels. Many of these communities also contain important proteins that could be useful to the biotechnological and pharmaceutical industries, such as enzymes involved in the production of antibiotics.
Metagenomics is the study of these different micro-organism communities, which is achieved by isolating the DNA from the organisms within an environmental sample (e.g. water, soil, animal stool), sequencing the DNA, followed by the computational analysis to decode which organisms are present and the functions they might be performing. This computation is complicated: (1) there is a huge amount of data; (2) The sequence data is a jumbled mix of fragments from different organisms; (3) Decoding the DNA is hard - typically >90% of organisms within a sample are not well characterised.
This proposal brings together three major resources within the field of metagenomics data archiving and analysis. The European Nucleotide Archive (ENA) is a repository of DNA sequence data. Importantly, ENA also captures metagenomic contextual data, such as where and when the sample was taken, how the DNA was extracted and sequenced. The EBI metagenomics portal (EMG, UK) and MG-RAST (MGR, US) are two metagenomics sequence analysis platforms. Uniquely, they represent the only free to use services, whereby researchers can upload sequence data and have it analysed without restriction. Despite the widespread use of metagenomics, currently the community lacks standards to ensure that metagenomics sequence data and the derived functional and taxonomic information are deposited within a database of record. Consequently, the navigation between metagenomics datasets is very difficult for even experienced users. As they offer slightly different, yet complementary, analysis services, there is often the desire to have a metagenomics dataset analysed by both resources. But, the number of equivalent datasets between the two resources is unknown. Unless a user has prior knowledge about equivalent projects, they remain disconnected. Also, sequence data submitted to MGR may not necessarily be deposited in ENA. We propose to set up a computational framework, termed Metagenomics Exchange (ME), to enable metagenomics datasets and the results of their analysis to be linked. All sequences will become available to the research community via ENA and analysis results we be automatically exchanged between EMG and EMR. The ME will be implemented to enable other metagenomics analysis providers to join, and so that it can be used by researchers wishing to perform large scale analyses. We will also investigate ways that our own pipelines can be enhanced through the use of the ME, sharing software and processing tasks, for example. This will lead to computational savings, increasing the capacity for metagenomics analysis. We will also generate a knowledge transfer forum, enabling the exchange of ideas on a range of topics, from hardware solutions to algorithms. Finally, we will undertake a research program to investigate the optimal combination of pipeline analysis components, and whether a single, unified analysis pipeline could be engineered.

Technical Summary

Metagenomics is a widely used approach to investigate the composition of microbial communities. With the development of modern sequencing platforms, (sequence) data generation is rarely the bottleneck, but rather its analysis. MG-RAST (MGR) and EBI Metagenomics (EMG) are the two world-leading metagenomics analysis platform. These analysis platforms employ distinct, yet complementary, approaches for the functional characterisation of metagenomic sequences. However, their pipelines closely align in the early stages of analysis, such as quality control. Unlike the other datatypes, there is no mandate for researchers to submit metagenomics data to an analysis platform. Furthermore, resources such as MGR are not linked to an INSDC member, such as the European Nucleotide Archive (ENA). Currently metagenomics sequence data, associated contextual metadata and derived functional and taxonomic assignments are disjointed within the field. Consequently, it is virtually impossible to navigate these cumbersome datasets. We propose to solve this problem by the development of a 'Metagenomics Exchange' (ME), which builds upon ENA technologies, to provide a registry of metagenomics datasets. MGR and EMG will use this registry to discover new datasets and publish their derived annotations, using tools and RESTful APIs to push/pull information from the registry. With the ME in place, we will populate it with existing datasets - developing the tools necessary to identify equivalent datasets. MGR and EMG will standardise on common analysis components and utilise the ME to enable crosstalk between pipelines, reducing computational overhead. The two teams will also exchange technology knowledge, such as data storage solutions and pipeline containerization. The websites will be harmonised to seamlessly present federated analysis results from both platforms, thereby enriching interpretation. We will investigate optimal pipeline solutions that may pave the way for a unified pipeline.

Planned Impact

The use of metagenomics is widespread, with its application in diverse fields, e.g. agriculture, food manufacture, the elucidation of both antibiotic products and antibiotic resistance mechanisms, bioenergy, crop yields and animal/human health. Consequently, metagenomics data continues to grow exponentially, with ever increasing demands on community analysis services. As yet, the field lacks systematic co-ordination and organisation of sequence data and derived functional and taxonomic information. We propose to solve this through the development of the Metagenomics Exchange (ME), which will primarily address the key area of data driven bioscience, but also have significant influences on many of the strategic priorities for the BBSRC and NSF.
The impact of both the EBI metagenomics (EMG) and MG-RAST (MGR) analysis platforms on academic research are already in effect. Both provide robust, specialised analyses and access to significant amounts of compute (~55 million CPU hours/year). The ME will catalogue information about different metagenomic sets and their analyses, enabling users from both academic and industrial sectors to rapidly discover them. Moreover, EMG and MGR will collect and present results from each other's platform, ensuring that a user is presented with all available analyses (saving user time/effort). To reduce duplications and to minimise differences, EMG and MGR will standardise on common parts of their pipelines. This will improve consistency and, as the project matures, allow crosstalk between the analysis pipelines. Crosstalk will also reduce computational overhead, allowing greater throughput for the community. The EMG and MGR websites collectively have 100,000s of individual visitors per year. Steps to harmonise the websites will improve user experience for both new and existing users.
Our objective of improving data discoverability via ME is to allow metagenomics results to reach a broader life science community, where individuals may be otherwise unaware of the data. It is important to also note that, in this project, we are also establishing a new collaboration, enabling MGR and EMG to become more aligned. Knowledge transfer between the groups will expand both UK and US skills in high throughput bioinformatics analysis.
The staff employed on this grant will receive hands-on training from members in the Finn, Cochrane and Meyer teams. All the institutes have excellent training schemes and career development courses and the staff will be working in world class laboratories of internationally renowned scientists. They will have opportunities to present their work within the groups, between the groups and at international conferences. Both technical developments and research findings will be presented at conferences and published in peer reviewed journals. Information about all the resources, especially the new ME, will be disseminated to the community via peer-review journals, conference presentations, a specialist workshop, and online training materials. We will also engage with the non-specialist and public domains via non-scientific literature, social media (blogs and tweets) and by attending meetings aimed at a range of audiences. These activities will maximize dissemination into the academic, industrial and 3rd-party communities.
MGR and EMG will leverage their links to the industrial sectors to ensure that this sector's needs are met. Indeed, the biotechnology industry may benefit the most from the implementation of ME, as they are frequently engaged in identifying catalytic activities across multiple datasets. The ME will enhance the translation of metagenomics research to industrial applications. In the longer term, the knowledge gained from understanding complex communities will have significant impacts for the UK, US and World economies from more efficient industrial enzymes, through improved soil conditions and crop yields, to healthcare solutions by comparing diseased and healthy states.

Publications

10 25 50

publication icon
Harrison PW (2021) The European Nucleotide Archive in 2020. in Nucleic acids research

publication icon
Richardson L (2023) MGnify: the microbiome sequence data analysis resource in 2023. in Nucleic acids research

publication icon
Mitchell AL (2020) MGnify: the microbiome analysis resource in 2020. in Nucleic acids research

publication icon
Toribio AL (2017) European Nucleotide Archive in 2016. in Nucleic acids research

publication icon
Amid C (2020) The European Nucleotide Archive in 2019. in Nucleic acids research

publication icon
Harrison PW (2019) The European Nucleotide Archive in 2018. in Nucleic acids research

 
Description We have designed, implemented, and released the infrastructure for housing the Metagenomics Exchange (ME) data. At the core of the ME is a simple registry that captures whether a particular resource has either a sequence dataset, or a set of analysis results for a dataset. Both resource providers and end users interact with the ME Registry through the API. MGnify has currently analysed 75% of the MG-RAST brokered datasets, paving the way for exchanging analysis results between the two sites. As MGnify has introduced assembly, the ME Registry has been extend to handle both assembled sequence data accessions, as well as the original sequencing project. No sequence data is held within the registry, but rather the accessioned mappings between two "equivalent" sequences sets found in each of the two broker resources (ENA and MG-RAST). Identified sequence set accessions can then be used to query the ME registry to provide access to the locations of the respective sequence sets and associated analysis results. The ME Registry consists of two types of API - an administration interface and a public read-only interface. The administration panel allows resource providers to register and manage their datasets for exchange, while the second publicly available read-only interface allows users to find and query the identified runs for mappings to metagenomics datasets. Authorisation for the registry is performed using access tokens. Each resource provider group (MG-RAST, MGnify and ENA) has their own token, which is required for all administration tasks (submit, update, delete), as well as read access to pre-publication data.

Determining sequence datasets that are equivalent in MG-RAST and ENA has proved to be challenging. One of the main issues is that MG-RAST does not store the original raw FASTQ files, but rather stores the quality controlled FASTA files. Thus, retrospective population of the ME registry was less straightforward than originally anticipated. The sorts of tools that make this activity possible are only becoming available now when metadata matching do not enable this to be achieved. We have extended the ME registry to include the method(s) used to infer equivalence: hash_of_sequence, kmer_profile, taxonomy_signature, functional_signature, gps_coordinates, biome, other_metadata. Confidence in the results will be provided and is defined as full if the sequence hashes match, high if the biome and GPS match, medium for a good combination of other fields and low for uncertain matches. While many of the older datasets will not be mapped, part of the work focuses on enabling the brokering of MG-RAST sequence datasets into ENA. This will enable the direct capture of equivalence between MG-RAST and ENA.

Both MGnify and MG-RAST have adopted the use of Common Workflow Language (CWL) for the description of their analysis pipelines in a standard fashion. To achieve this, the MG-RAST execution framework (AWE) has been extended to be able to execute CWL pipelines. The MGnify analysis pipelines have been described in CWL since version 3, and we are now currently in version 5. We have also described the MGnify assembly pipeline in CWL. This latest version of the MGnify analysis pipeline, i.e. version 5.0 has been re-worked to produce three distinct versions (amplicon, raw read, and assembly analysis). These have been rapidly built using common components (subworkflows) where appropriate. We have also successfully evaluated different execution engines for running the CWL on different compute infrastructures. Both MG-RAST and MGnify teams are promoting the use of CWL as part of the Genome Standards Consortium, as well as through contributions in commentaries. Describing pipelines in a standard format allows complete provenance of the pipeline (e.g. allowing reproducible science), simpler comparison between the two, as well as rapidly rebuilding and combining components.

To enrich search and retrieval of data from MGnify, we have developed and released a RESTful API, providing programmatic access to all of the data contained within the resource. The base address to the API gives access to several collections of resources, such as studies, samples, runs, experiment- types, biomes and annotations. Combined with appropriate relationships to other resources, these can be filtered and sorted by selected attributes, allowing complex queries to be constructed (for example: 'retrieve all oceanographic samples from metagenomic studies taken at temperatures less than 10C). The provision of such complex queries allows metadata to be combined with annotation for powerful data analysis and visualisation. We have utilised an interactive documentation framework (Swagger UI) to visualise and simplify interaction with the API's resources via an HTML interface, allowing less experienced users to interactively build up API queries. Detailed explanations of the purpose of all resources, along with many examples, are also provided to guide end-users. This, in combination with the MG-RAST API, provides the underlying mechanisms for data exchange between the resources and for disseminating the results. Consequently, the MGnify API has witnessed substantial increase in usage, receiving millions of requests per month.

First developed in 2010, the former MGnify website was not designed with modern API approaches in mind, and adopted the now antiquated design of the server directly contacting the backend database. Therefore, exposing new data types and pulling in data via the MG-RAST API was going to be extremely time consuming. To overcome this, we have completely rewritten the MGnify site in order to consume the new MGnify API (thereby reducing duplication of effort). Furthermore, the website was rebuilt in a modern framework; this included the development of a portable JavaScript library to consume the MGnify API (implemented in Backbone JS). This may be released using a public package repository in the future and will be shared with MG-RAST to enable them to consume and display MGnify outputs with minimal effort.

During the course of this project, MGnify has added metagenomic assembly as another component of the pipeline repertoire. We have each shared our experiences with metagenomics assembly, especially in terms of different algorithms performance and quality of assemblies. The MGnify team has showcased their neural network for assembly parameter estimation. We have also exchanged ideas on API design and the benefits (and drawbacks) of using standards/best practices. As part of this work, we have a technical article on API provision published in Plos Comp Biol. We have also tried to look for consistencies between our API endpoints and have a clear understanding of each other's APIs (and infrastructures). For example, MG-RAST has a Cassandra back system, where MGnify is backed via MongoDB. While we both adopt NoSQL solutions, Cassandra offers greater search functionality compared to MGnify's current MongoDB system. This limited search is being overcome by releasing software solutions that enable equivalent searches by combining API queries (e.g. the Metagenomics Tool Kit). Furthermore, we have had specific meetings describing the containerisation of our workflows. We have also exchanged ideas surrounding the use of Simka for Kmer profiling of datasets. Due to the nature of this algorithm, which removes lowly abundant Kmers (and can be the cause of small variations introduced by quality control), it has been possible to match imperfect datasets. However, we have not been able to scale the update procedure of this and are currently investigating solutions and alternatives.

We have reviewed the respective steps in our pipelines to identify commonalities and where a common solution may prove beneficial. The initial comparisons of the pipelines have indicated that the highest degree of overlap resides in the initial quality control and trimming sections. We also strongly believe that our independent approaches to functional annotation are complementary. MG-RAST provides the best match to a sequence using sequence similarity searches against a large sequence database, while MGnify provides matches to different protein family databases. As many sequences lack functional annotation, the domain annotations can be more informative, while on the other hand, the presence of certain domains does not always provide a description of the overarching function of a sequence, where a full-length match to annotated sequence would. To overcome these limitations, MGnify has adopted DIAMOND searches with UniRef90 for the annotations of their assemblies. Similar, MGnify also includes KEGG and antiSMASH annotations, which provide access to higher order annotations. Since the commencement of this project, MGnify has moved to offering assembly as a service, a capacity that MG-RAST is yet to afford. We are sharing our workflow descriptions for this process, putting CWL into practice to achieve these outcomes.
Exploitation Route Although the Metagenomics Exchange (ME) has originally been developed with MG-RAST and MGnify, the model is completely agnostic about analysis source. The only restriction is that the underlying sequence data that the analysis is based upon is found within the ENA (submitted directly or to one of the INSDC partners). This means that other metagenomics analysis resources, such as IMG/M and iMicrobe could also use the ME to expose their analysis results, making them discoverable for other research scientists.

With the current systems, we will make it simple for research scientists to know when a common dataset has been analysed in both resources. As both resources have different analysis strategies, they may highlight different features in the dataset, accelerating the rate of novel discovery. Moreover, when the results are consistent, it provides independent validation of the results.

The new MGnify website now provides a more consistent view of the data plus the associated API, providing access to the terabytes of processed data. This API is accompanied by software libraries that both illustrate the use of the API using standard libraries and can be used to access the data. At the time of writing, these libraries have been downloaded over 25,000 times.

The CWL descriptions of our pipelines allow for complete provenance of the analysis, increasing transparency of how the results were derived and how two pipelines may differ, allowing scientists to account for the differences that arise from informatics variation. Furthermore, these CWL descriptions can be taken and extended or modified (e.g. inclusion of new tools or reference databases). Our use of CWL is also driving the execution frameworks (being developed by third parties), e.g. Toil and CWLEXEC. As CWL is not confined to biology, it potentially has a very broad impact.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Manufacturing, including Industrial Biotechology

 
Description MGnify pipelines have been encapsulated using the common workflow language (CWL). Examples are now available in the WorkflowHub (https://workflowhub.eu/). These pipelines have been used as exemplars to a wide range of communities (industrial and academic) on how to develop complex computational analysis pipelines, and how they can be used to capture the provenance of the analysis (software tools, parameters, reference databases, inputs and outputs). Establishing this foundation has opened up new opportunities to collaborate with the scientific community, with MGnify pipelines forming the basis of new metagenomic analysis pipelines that have been developed and deployed on the European Open Science Cloud compute (EOSC) infrastructure. Furthermore, this work has facilitated the use of research objects crates to federate the analysis of the metagenomic datasets. We have also contributed to open access software for executing these workflows, namely Toil, which is also used across sectors, including physics and geography. The taxonomic identification data generated in MGnify is now flowing into the Global Biodiversity Information Facility (GBIF), which has a completely different audience to either MGnify or MG-RAST. This data is presenting a new view on environmental biodiversity and provides the ability to connect between biodiversity and museum collections. The data in GBIF is also used to inform governmental policies across the world, with MGnify representing one the largest contributors of taxonomic identifications. Subsequent to this work, we have had further discussions with MG-RAST on how to integrate some of their unique features within MGnify to enable us to meet some of the demands of the UK and wider research community. For example, we have been approached about having the SEED subsystem annotations in MGnify. This work facilitated an understanding of the MG-RAST pipeline, which enabled us to rapidly evaluate the cost-benefit of introducing additional analysis components to the MGnify systems.
First Year Of Impact 2020
Sector Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Government, Democracy and Justice,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology
Impact Types Economic,Policy & public services

 
Description Workflow systems turn raw data into scientific knowledge
Geographic Reach Multiple continents/international 
Policy Influence Type Influenced training of practitioners or researchers
Impact These workflow tools can make your computational methods portable, maintainable, reproducible and shareable.
URL https://www.nature.com/articles/d41586-019-02619-z
 
Description (EOSC-Life) - Providing an open collaborative space for digital biology in Europe
Amount € 23,745,996 (EUR)
Funding ID 824087 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 03/2019 
End 02/2023
 
Title Application of CWL for describing analysis workflow 
Description Different services provided by the MGnify resource, namely assembly and analysis have been encapsulated in the common workflow language (CWL), which allows complete provenance of the software and/or reference databases used, associated parameters and more recently, associated containers providing access to these tools. 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? Yes  
Impact Our use of CWL has driven both the specification and the development of CWL execution engines that are required to execute them. The CWL is community project, involving cross disciplinary teams. One execution framework, Toil, is an open source software project and have been developed by the community when bugs in the software have been reported by us. Similarly, IBM developers have been improving CWLEXEC in response to our work. Both MGnify and MG-RAST pipelines are now both described in CWL, allowing both teams, as well as others, to more readily compare the pipelines and understand the similarities and differences. These CWL descriptions can also be reused by the community, either to build novel workflows, or to adapt the existing workflows by introducing new tools and reference databases. Finally, the adoption of CWL has allowed us to elastically scale our compute, by using both academic and commercial clouds to assess cost/benefits, in this changing landscape. 
URL https://www.commonwl.org
 
Title ENA 
Description The European Nucleotide Archive (ENA) captures and presents information relating to experimental workflows that are based around nucleotide sequencing. A typical workflow includes the isolation and preparation of material for sequencing, a run of a sequencing machine in which sequencing data are produced and a subsequent bioinformatic analysis pipeline. ENA records this information in a data model that covers input information (sample, experimental setup, machine configuration), output machine data (sequence traces, reads and quality scores) and interpreted information (assembly, mapping, functional annotation). Data arrive at ENA from a variety of sources. These include submissions of raw data, assembled sequences and annotation from small-scale sequencing efforts, data provision from the major European sequencing centres and routine and comprehensive exchange with our partners in the International Nucleotide Sequence Database Collaboration (INSDC). 
Type Of Material Database/Collection of data 
Provided To Others? Yes  
Impact This ENA is the European arm of INSDC. However, ENA has specifically been extended to allow the deposition of metagenome assemblies, binned assemblies and metagenome assemblies. We have also worked on ensuring that metadata associated with sequence data are appropriately capture by the development of checklists. 
URL https://www.ebi.ac.uk/ena
 
Title MGnfiy (formerly called EBI metagenomics) 
Description The MGnify resources is an automated pipeline for the analysis and archiving of metagenomic data that aims to provide insights into the phylogenetic diversity as well as the functional and metabolic potential of a sample. It enables users to freely browse all the public data and associated analysis results that are contained within the resource. More recently (in 2018) we have started to provide metagenomics assembly as a service to the community, which is often not performed due to the computational overheads. 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact The MGnify provides access to some of the largest metagenomics projects and is the large collection of analysed metagenomic datasets. Uniquely, it enables the consistent analysis between projects enabling scientist to compare results to other datasets in the resource or to their own. 
URL https://www.ebi.ac.uk/metagenomics
 
Title MGnify (previously EBI Metagenomics Portal) 
Description MGnify, previously EBI Metagenomics, (https://www.ebi.ac.uk/metagenomics/) is a database of richly described shotgun metagenomics data sets from across sample environments. Drawing on user-submitted data, functional and taxonomic analysis pipelines provide systematic processing and analysis of data. Both input data and analysis outputs available freely in a variety of presentations and downloadable data formats. The database combines permanent archiving functions (through connectivity with public sequence databases) and state-of-the-art analysis methods. 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact The database is core to the MGnify programme, such that general programme impacts (see elsewhere in our outcome reporting for the programme) are all relevant to the database. 
URL https://www.ebi.ac.uk/metagenomics/
 
Title Metagenome Exchange Registry 
Description Database for the capture and presentation of data linking metagenomics analyses, such as from MG-RAST and MGnify to raw data sets in INSDC databases; includes Application Programmatic Interfaces for data input and access. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact The Metagenome Exchange Registry has been promoted towards users external to the project, such as JGI and the MAR databases. 
URL https://www.ebi.ac.uk/ena/registry/metagenome/api/
 
Description MG-RAST 
Organisation Argonne National Laboratory
Country United States 
Sector Public 
PI Contribution Discussing ideas and experiences on large scale bioinformatics analysis of metagenomics. Knowledge of data submission.
Collaborator Contribution Data submission to ENA of metagenomcis datasets. Knowledge of metagenomics analysis.
Impact Plan for pipeline interoperability.
Start Year 2017
 
Title Metagenomics toolkit 
Description Metagenomics toolkit enables scientists to download all of the sample metadata for a given study or sequence to a single csv file. 
Type Of Technology Software 
Year Produced 2018 
Open Source License? Yes  
Impact Improved access to sample metadata enabling easier integration to workflows. 
URL https://pypi.org/project/mg-toolkit/
 
Description 1st Microbiome PT Summit keynote talk "Can microbiome analysis be FAIR?" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Microbiome research has grown substantially and produced massive data over the past decade in terms of the range of biomes sampled, facing challenges in terms of data findability, accessibility, interoperability and reuse. ELIXIR, the European Infrastructure for Biological Data, is addressing these topics via domain-specific communities, namely via its Microbiome Community. BioData.pt, as the Portuguese Node of ELIXIR, is assembling its National Community to engage Portuguese researchers on this topic in this European effort. Therefore Portuguese Microbiome Community, led by Isabel Gordo, from Instituto Gulbenkian de Ciência, organised its first National Summit to raise awareness and gather scientists addressing microbiome research in Portugal. Dr Rob Finn (EMBL-EBI) - Head of ELIXIR Microbiome Community and EMBL-EBI's Microbiome Informatics team chaired the opening session and gave the keynote talk.
Year(s) Of Engagement Activity 2021
URL https://www.biodata.pt/node/289
 
Description 2020 Annual Research Conference of the Pontificia Universidad Católica del Perú talk "Broadening our genomic knowledge of the human microbiome" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact PI Dr Rob Finn described the role of MGnify, including the resource's gut catalogue in microbiome research. He highlighted how Latin American samples were underrepresented. Finally, he provided advice on the different career paths available for researchers in bioinformatics.
Year(s) Of Engagement Activity 2020
 
Description 2020 POGO International Virtual Conference on the use of Environmental DNA (eDNA) in Marine Environments: Opportunities and Challenges talk "MGnify: An open and scalable platform for the analysis, discovery and dissemination of molecular based biodiversity data" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact PI Dr Rob Finn gave a talk during the 2020 POGO International Virtual Conference on the use of Environmental DNA (eDNA) in Marine Environments: Opportunities and Challenges. His talk focused on MGnify during day 2 of the conference; session on Data and Information. Session description is as follows: Through systems such as the International Nucleotide Sequence Database Collaboration (INSDC) and global standards like FASTA/Q format, the eDNA/omics community have benefitted from world-class data and information resources. However, our handling of what is, from our perspective, "metadata" and participation/interoperability with data systems from other disciplines is still in need of advancement. In the marine realm, we now have new opportunities to augment our digital capacities while aligning them with global digital strategies such as those within the UN Decade of Ocean Science for Sustainable Development. This session will explore some examples of how this is already taking place, and will welcome discussion on how we can collectively mainstream sequence data (as well as the information and knowledge derived from it) in the emerging digital ocean ecosystem.
Year(s) Of Engagement Activity 2020
URL https://pogo-ocean.org/capacity-development/activity-related-workshop/environmental-dna-edna-marine-...
 
Description 21 GSC Meeting talk "EBI's use of CWL workflows" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact Talk given at the 21st Genomic Standards Consortium Meeting held at the University of Vienna, Austria.
Year(s) Of Engagement Activity 2019
URL https://gensc.org/meetings/gsc21/
 
Description 5th Microbiome Movement - Drug Development Europe conference talk "Magnifying the Human Gut Microbiome" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact During 5th Microbiome Movement - Drug Development Europe conference, PI Dr Rob Finn presented a talk on the unified human gut genome catalogue and phages, with a view to understanding the potential translational impact of human microbiome research.
Year(s) Of Engagement Activity 2021
URL https://microbiome-europe.com/?utm_source=hw-corporate&utm_medium=backlink&utm_campaign=brand-page
 
Description BiATA 2019 invited talk "Insights into the human gut microbiota from a (meta-)genomic perspective" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact PI Dr Robert Finn was a featured speaker at the 2019 BiATA conference held at the Graduate School of Management St Petersburg University, Russia. The talk covered recent work carried out by the team that resulted in new insights into the human gut microbiota.
Year(s) Of Engagement Activity 2019
URL http://biata2019.spbu.ru
 
Description BiATA 2020 workshop on "Analysing metagenomic assemblies using MGnify" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact In this two day remote tutorial provided by PI Dr Rob Finn and his team during the BiATA2020 conference, participants explored common approaches to analysing and annotating contigs produced from a metagenomics assembly. The course was a mixture of introductory lectures, followed by hands-on practicals. Due to time constraints, participants either investigated pre-calculated examples or used a web browser to explore outputs via the MGnify website (www.ebi.ac.uk/metagenomics). By the end of the course, participants understood how to process contigs, functionally and taxonomically characterise the contigs, and were able to generate metagenome assembled genomes from your assemblies.
Year(s) Of Engagement Activity 2020
URL http://biata2020.spbu.ru/workshop/
 
Description CABANA training workshop titled "Introduction to Metagenomics" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Training modules in metagenomics were delivered during the 5 day CABANA workshop held at the Faculty of Natural Sciences - University of Buenos Aires (FCEN-UBA), Argentina. In this course, participants learnt the basics of metagenomics, covering experimental design and workflows, moving through to microbiome analysis via metabarcoding and shotgun metagenomics. The course theme focused on metagenomics oriented to biodiversity. Talks and hands on practical sessions were delivered to cover all aspects of the course work.
Year(s) Of Engagement Activity 2019
URL https://www.ebi.ac.uk/training/events/2019/cabana-workshop-introduction-metagenomics
 
Description CSHL Biology of Genomes 2020 talk titled "Broadening our genomic knowledge of the human microbiome" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Biology of Genomes 2020 meeting organised by the Cold Spring Harbor Laboratory addressed DNA sequence variation and its role in molecular evolution, population genetics and complex diseases, comparative genomics, large-scale studies of gene and protein expression, and genomic approaches to ecological systems. Both technologies and applications were emphasized. There was a special session on the ethical, legal and social implications (ELSI) of genome research. PI Dr Rob Finn chaired the session on Complex Traits and Microbiome and presented a talk.
Year(s) Of Engagement Activity 2020
URL https://meetings.cshl.edu/meetings.aspx?meet=GENOME&year=20
 
Description Connecting Science Public Engagement Prizes 2020 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact The Wellcome Genome Campus (EMBL-EBI and WSI) Public Engagement Advocacy prize recognises members of staff or students who have enabled positive change in public engagement through leadership, guidance, practical measures or emotional support. EMBL-EBI PI Dr Rob Finn won the 2020 award for Public Engagement Advocacy.
Year(s) Of Engagement Activity 2020
URL https://publicengagement.wellcomegenomecampus.org/connecting-science-public-engagement-prizes-2020
 
Description EBI Industry talk titled "A new genomic blueprint of the human gut microbiota" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact This talk was presented during the EBI industry programme quarterly meeting held at EMBL-EBI, UK and focused on future developments of MGnify and making human genomes accessible.
Year(s) Of Engagement Activity 2019
 
Description ELIXIR Europe tweet "ELIXIR CDR corona virus genomes MGnify" 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Tweet from the official ELIXIR Europe account highlighting the work carried out by PI Dr Rob Finn and his microbiome informatics team, which utilised the MGnify resource workflows to identify coronavirus genomes in ELIXIR Core Data Resources, such as the ENA (DOI 10.1093/bib/bbaa232).
Year(s) Of Engagement Activity 2020
URL https://twitter.com/ELIXIREurope/status/1323597075007840256
 
Description ELIXIR News "Identification of coronaviruses genomes in public datasets" 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The ongoing SARS-CoV-2 pandemic highlighted the need to understand all aspects of coronavirus biology, including their prevalence and diversity in animal hosts and the environment. Given the pressing need for greater knowledge around this topic, researchers within the Microbiome Informatics Team (PI Dr Rob Finn) at EMBL- European Bioinformatics Institute (EMBL-EBI) repurposed existing MGnify infrastructure to generate a pipeline that detects and characterises coronaviruses from metavirome and metatranscriptomic datasets. This pipeline identified a complete SARS-CoV-2 genome from a human lung sample collected in Wuhan, China, at the start of the pandemic - demonstrating proof of concept (DOI 10.1093/bib/bbaa232).
Year(s) Of Engagement Activity 2020
URL https://elixir-europe.org/news/identification-coronaviruses-genomes-public-datasets
 
Description EMBL Newsletter "Unparalleled inventory of the human gut ecosystem" 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact EMBL newsletter on the Finn team's Nature Biotech publication (10.1038/s41587-020-0603-3). In this paper, Dr Finn and his collaborators describe their work of compiling into a public database over 200 000 genomes from more than 4,600 species of gut bacteria.
Year(s) Of Engagement Activity 2020
URL https://www.embl.org/news/science/inventory-of-the-human-gut-ecosystem/
 
Description EMBL Science Education (ELLS Heidelberg) tweet "Meet my microbiome 2020" 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Schools
Results and Impact Tweet from the official Twitter account of the EMBL Sceince Education (ELLS Heidelberg) inviting school teachers from Europe and beyond to learn about current research on the human microbiome and how to transfer this knowledge to their classrooms! Each module takes one week and is designed to fit the busy schedule of teachers! #MeetingMyMicrobiome2020
Year(s) Of Engagement Activity 2020
URL https://twitter.com/ELLS_Heidelberg/status/1314259917612736515
 
Description EMBL-EBI intranet news "Celebrate our public engagement champions!" 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Wellcome Genome Campus (EMBL-EBI and WSI) annual Connecting Science Public Engagement Prizes celebrate and recognise outstanding efforts in engaging the public with the science, technology, research and innovation of the Wellcome Genome Campus. The 2020 prize winners were announced on 15 October in a virtual ceremony and Dr Rob Finn won the 2020 award for Public Engagement Advocacy.
Year(s) Of Engagement Activity 2020
URL https://tsc.ebi.ac.uk/news/celebrate-our-public-engagement-champions
 
Description EMBL-EBI online tutorial "Metagenomics bioinformatics - A practical introduction" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This course covered the use of publicly available resources to manage, share, analyse and interpret metagenomics data, including marker gene, whole gene shotgun (WGS) and assembly-based approaches. It makes use of recorded lecutures and materials from the "Metagenomics Bioinformatics" training course that took place 17 - 20 July 2018 at EMBL-EBI. The recorded lecture material is aimed at life scientists working in the field of metagenomics who are in the early stages of their data analysis. These recordings are suitable for beginners with an undergraduate knowledge of metagenomics. The exercises included in this course are intended for an audience with experience of using bioinformatics in their research. A working knowledge of Unix command line and the R statistical package is required.
Year(s) Of Engagement Activity 2020
URL https://www.ebi.ac.uk/training/online/courses/metagenomics-bioinformatics/
 
Description EMBL-EBI press release "Unparalleled inventory of the human gut ecosystem" 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact EMBL-EBI press release on the Finn team's Nature Biotech publication (10.1038/s41587-020-0603-3). In this paper, Dr Finn and his collaborators describe their work of compiling into a public database over 200 000 genomes from more than 4,600 species of gut bacteria.
Year(s) Of Engagement Activity 2020
URL https://www.ebi.ac.uk/about/news/press-releases/inventory-human-gut-ecosystem
 
Description EMBL-EBI training course "Metagenomics bioinformatics (virtual)" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This course covered the metagenomics data analysis workflow from the point of newly generated sequence data. Participants explored the use of publicly available resources and tools to manage, share, analyse and interpret metagenomics data. The content included issues of data quality control and how to submit to public repositories. While sessions detailed marker-gene and whole-genome shotgun (WGS) approaches; the primary focus was on assembly-based approaches. Discussions also explored considerations when assembling genome data, the analysis that can be carried out by MGnify on such datasets, and what downstream analysis options and tools are available.
Year(s) Of Engagement Activity 2020
URL https://www.ebi.ac.uk/training/events/metagenomics-bioinformatics-virtual/
 
Description EMBL-EBI/WSI Seminar Series talk "The human microbiome beyond the gut" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact This seminar was presented as a part of the EMBL-EBI and Wellcome Sanger Institute joint monthly seminar series. PI Dr Rob Finn's talk focused on recent efforts to recover MAGs from the human skin microbiome, which not only harbours a very distinct microbial composition compared to the gut, but also carries additional challenges such as low DNA yield. Approaches were presented to overcome these challenges and some of the insights we have obtained into the microbial skin diversity.
Year(s) Of Engagement Activity 2020
URL https://www.ebi.ac.uk/about/events/seminars/2020/ebisanger-seminar-series-rob-finn-and-phil-jones-zo...
 
Description EOSC-Life Seminar Series talk "Metagenomic data analysis workflows in CWL from scratch to multi-environment production" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact MGnify team member Martin Beracochea gave a talk during the EOSC-Life Seminar Series. He described how the MGnify pipelines already adhere to the FAIR principles, and how these could be deployed in cloud environments.
Year(s) Of Engagement Activity 2021
URL https://www.eosc-life.eu/d3/
 
Description EOSC-Life hackathon titled "Tool profiling in Toil, and testing of cwl-toil-runner" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Participated in a hackathon for technical experts to improve CWL implementation of tools organised as part of the EOSC-Life WP1 held in Germany. The hackathon brings together individuals interested in common data types (e.g. genomics) but who may originate from different communities (e.g. plant genomics and rare diseases for instance).
Year(s) Of Engagement Activity 2019
URL https://www.eosc-life.eu/news/hackathon/
 
Description European Learning Laboratory for the Life Sciences, ELLS blog "Introducing your microbiome" 
Form Of Engagement Activity Engagement focused website, blog or social media channel
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Schools
Results and Impact The European Learning Laboratory for the Life Sciences (ELLS), EMBL's education facility, invited secondary school science teachers to participate in a virtual training course in the autumn of 2020 entitled 'Introducing your microbiome'. The course was divided into four modules, providing an overview of current human microbiome research, introducing bioinformatics as a tool in microbiome research, and exploring microbiome research in health and disease. The final module consisted of group work in small teams, in which participants developed their own educational materials. The modules were taught by EMBL scientists PI Drs Rob Finn and Michael Zimmerman. The course was organised in collaboration with the Public Engagement officer at EMBL's European Bioinformatics Institute (EMBL-EBI) and was held entirely online.
Year(s) Of Engagement Activity 2020
URL http://emblog.embl.de/ells/virtual-llab-microbiome-2020/
 
Description Laboratory News feature titled "The secret microbiome" 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact Research work utilising MGnify to unlock the complexity and diversity of the human gut microbiome was featured in an article printed by Laboratory News.
Year(s) Of Engagement Activity 2019
URL http://www.labnews.co.uk/article/2024791/the_secret_microbiome
 
Description NATURE Milestone 25 titled "Metagenome-assembled genomes provide unprecedented characterization of human-associated microbiota" 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A publication from the group titled "A new genomic blueprint of the human gut microbiota" [Nature https://doi.org/10.1038/s41586-019- 0965-1 (2019)] was featured by Nature as part of a Milestone in Human Microbiota Research. https://media.nature.com/original/magazine-assets/d42859-019-00061-9/d42859-019-00061-9.pdf
Year(s) Of Engagement Activity 2019
URL https://www.nature.com/articles/d42859-019-00061-9
 
Description National Microbiome Data Collaborative Workshop: linking MIxS standards, Environment ontology, and GAZ; Burgin 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A workshop from the US National Microbiome Data Collaborative initiative. We represent standards and tools that have been developed under MGP-III that are of value to this initiative. Discussions took place around these and other tools. Alignment with this project will secure global data accessibility and reach for data already routed towards ENA and MGNify.
Year(s) Of Engagement Activity 2019
 
Description New Scientist Live - Gut Health: Revealing the Power of the Microbiome event talk titled "Blueprint of the human gut microbiota" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact In this talk Alexandre Almeida described the extraordinary detective work involved and how a blueprint of the human gut could help us understand human health and diseases better.
Year(s) Of Engagement Activity 2019
URL https://www.list.co.uk/event/1438099-gut-health-revealing-the-power-of-the-microbiome/
 
Description Popular Science article titled "Scientists think they've found 1,952 new species living in our poop" 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Research work from the group that utilised MGnify to reveal the complexity and diversity of the human got microbiome resulted in a 2019 Nature publication titled "A new genomic blueprint of the human gut microbiota [https://doi.org/10.1038/s41586-019-0965-1]. This work was then featured in the Popular Science article.
Year(s) Of Engagement Activity 2019
URL https://www.popsci.com/gut-microbiome-new-bacteria/
 
Description Swiss Academy of Sciences talk "Mining the novelty from metagenomic sequencing" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact In this talk to the Swiss Academy of Sciences, PI Dr Rob Finn presented the services and research offered by his resource MGnify and how they can be used by European researchers.
Year(s) Of Engagement Activity 2020
 
Description University of Warwick Seminar series talk "Broadening our genomic knowledge of human microbiomes" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Life Sciences seminar by Dr Rob Finn. He described his team's recently publication on the Unified Human Gastrointestinal Genome (UHGG) catalogue, which is an unprecedented collection of nearly 5,000 gut species found in the the gut microbiome, with 70% yet to be cultured. Dr Finn provided further details on the team's recent efforts to recover genomes from the human skin microbiome, which not only harbours a very distinct microbial composition compared to the gut, but also carries additional challenges such as low DNA yield. For both microbiomes, the team is currently investigating the microbiota beyond bacteria. An overview of these results were presented, assessing the challenges faced when researchers try to understand microbial community structures.
Year(s) Of Engagement Activity 2020
URL https://warwick.ac.uk/insite/events/events?calendarItem=8a17841b75f501d70175f549d9980163