Micro-organisms are found in virtually all environments. Typically, they form the base of the food chain (such as plankton in the sea) and play essential roles in their ecosystems. There is often a complex interplay between different micro-organisms, with some organisms requiring that others be present in order for them to exist. When there is an imbalance within a community, this can lead to severe effects, such as disease in the human gut, or the inability for plants to grow efficiently in soil. An understanding of the composition and interplay within the communities allows us to potentially manipulate them. Thus, there is intense research into micro-organism communities in many different fields, such as improving livestock yields, the recovery from bacterial infections using fecal transplants and the efficient production of biofuels. Many of these communities also contain important proteins that could be useful to the biotechnological and pharmaceutical industries, such as enzymes involved in the production of antibiotics.
Metagenomics is the study of these different micro-organism communities, which is achieved by isolating the DNA from the organisms within an environmental sample (e.g. water, soil, animal stool), sequencing the DNA, followed by the computational analysis to decode which organisms are present and the functions they might be performing. This computation is complicated: (1) there is a huge amount of data; (2) The sequence data is a jumbled mix of fragments from different organisms; (3) Decoding the DNA is hard - typically >90% of organisms within a sample are not well characterised.
This proposal brings together three major resources within the field of metagenomics data archiving and analysis. The European Nucleotide Archive (ENA) is a repository of DNA sequence data. Importantly, ENA also captures metagenomic contextual data, such as where and when the sample was taken, how the DNA was extracted and sequenced. The EBI metagenomics portal (EMG, UK) and MG-RAST (MGR, US) are two metagenomics sequence analysis platforms. Uniquely, they represent the only free to use services, whereby researchers can upload sequence data and have it analysed without restriction. Despite the widespread use of metagenomics, currently the community lacks standards to ensure that metagenomics sequence data and the derived functional and taxonomic information are deposited within a database of record. Consequently, the navigation between metagenomics datasets is very difficult for even experienced users. As they offer slightly different, yet complementary, analysis services, there is often the desire to have a metagenomics dataset analysed by both resources. But, the number of equivalent datasets between the two resources is unknown. Unless a user has prior knowledge about equivalent projects, they remain disconnected. Also, sequence data submitted to MGR may not necessarily be deposited in ENA. We propose to set up a computational framework, termed Metagenomics Exchange (ME), to enable metagenomics datasets and the results of their analysis to be linked. All sequences will become available to the research community via ENA and analysis results we be automatically exchanged between EMG and EMR. The ME will be implemented to enable other metagenomics analysis providers to join, and so that it can be used by researchers wishing to perform large scale analyses. We will also investigate ways that our own pipelines can be enhanced through the use of the ME, sharing software and processing tasks, for example. This will lead to computational savings, increasing the capacity for metagenomics analysis. We will also generate a knowledge transfer forum, enabling the exchange of ideas on a range of topics, from hardware solutions to algorithms. Finally, we will undertake a research program to investigate the optimal combination of pipeline analysis components, and whether a single, unified analysis pipeline could be engineered.

Technical Summary

Metagenomics is a widely used approach to investigate the composition of microbial communities. With the development of modern sequencing platforms, (sequence) data generation is rarely the bottleneck, but rather its analysis. MG-RAST (MGR) and EBI Metagenomics (EMG) are the two world-leading metagenomics analysis platform. These analysis platforms employ distinct, yet complementary, approaches for the functional characterisation of metagenomic sequences. However, their pipelines closely align in the early stages of analysis, such as quality control. Unlike the other datatypes, there is no mandate for researchers to submit metagenomics data to an analysis platform. Furthermore, resources such as MGR are not linked to an INSDC member, such as the European Nucleotide Archive (ENA). Currently metagenomics sequence data, associated contextual metadata and derived functional and taxonomic assignments are disjointed within the field. Consequently, it is virtually impossible to navigate these cumbersome datasets. We propose to solve this problem by the development of a 'Metagenomics Exchange' (ME), which builds upon ENA technologies, to provide a registry of metagenomics datasets. MGR and EMG will use this registry to discover new datasets and publish their derived annotations, using tools and RESTful APIs to push/pull information from the registry. With the ME in place, we will populate it with existing datasets - developing the tools necessary to identify equivalent datasets. MGR and EMG will standardise on common analysis components and utilise the ME to enable crosstalk between pipelines, reducing computational overhead. The two teams will also exchange technology knowledge, such as data storage solutions and pipeline containerization. The websites will be harmonised to seamlessly present federated analysis results from both platforms, thereby enriching interpretation. We will investigate optimal pipeline solutions that may pave the way for a unified pipeline.

Planned Impact

The use of metagenomics is widespread, with its application in diverse fields, e.g. agriculture, food manufacture, the elucidation of both antibiotic products and antibiotic resistance mechanisms, bioenergy, crop yields and animal/human health. Consequently, metagenomics data continues to grow exponentially, with ever increasing demands on community analysis services. As yet, the field lacks systematic co-ordination and organisation of sequence data and derived functional and taxonomic information. We propose to solve this through the development of the Metagenomics Exchange (ME), which will primarily address the key area of data driven bioscience, but also have significant influences on many of the strategic priorities for the BBSRC and NSF.
The impact of both the EBI metagenomics (EMG) and MG-RAST (MGR) analysis platforms on academic research are already in effect. Both provide robust, specialised analyses and access to significant amounts of compute (~55 million CPU hours/year). The ME will catalogue information about different metagenomic sets and their analyses, enabling users from both academic and industrial sectors to rapidly discover them. Moreover, EMG and MGR will collect and present results from each other's platform, ensuring that a user is presented with all available analyses (saving user time/effort). To reduce duplications and to minimise differences, EMG and MGR will standardise on common parts of their pipelines. This will improve consistency and, as the project matures, allow crosstalk between the analysis pipelines. Crosstalk will also reduce computational overhead, allowing greater throughput for the community. The EMG and MGR websites collectively have 100,000s of individual visitors per year. Steps to harmonise the websites will improve user experience for both new and existing users.
Our objective of improving data discoverability via ME is to allow metagenomics results to reach a broader life science community, where individuals may be otherwise unaware of the data. It is important to also note that, in this project, we are also establishing a new collaboration, enabling MGR and EMG to become more aligned. Knowledge transfer between the groups will expand both UK and US skills in high throughput bioinformatics analysis.
The staff employed on this grant will receive hands-on training from members in the Finn, Cochrane and Meyer teams. All the institutes have excellent training schemes and career development courses and the staff will be working in world class laboratories of internationally renowned scientists. They will have opportunities to present their work within the groups, between the groups and at international conferences. Both technical developments and research findings will be presented at conferences and published in peer reviewed journals. Information about all the resources, especially the new ME, will be disseminated to the community via peer-review journals, conference presentations, a specialist workshop, and online training materials. We will also engage with the non-specialist and public domains via non-scientific literature, social media (blogs and tweets) and by attending meetings aimed at a range of audiences. These activities will maximize dissemination into the academic, industrial and 3rd-party communities.
MGR and EMG will leverage their links to the industrial sectors to ensure that this sector's needs are met. Indeed, the biotechnology industry may benefit the most from the implementation of ME, as they are frequently engaged in identifying catalytic activities across multiple datasets. The ME will enhance the translation of metagenomics research to industrial applications. In the longer term, the knowledge gained from understanding complex communities will have significant impacts for the UK, US and World economies from more efficient industrial enzymes, through improved soil conditions and crop yields, to healthcare solutions by comparing diseased and healthy states.
Description We have designed, implemented, and released the infrastructure for housing the Metagenomics Exchange (ME) data. At the core of the ME is a simple registry that captures whether a particular resource has either a sequence dataset, or a set of analysis results for a dataset. Both resource providers and end users interact with the ME Registry through the API. MGnify has currently analysed 75% of the MG-RAST brokered datasets, paving the way for exchanging analysis results between the two sites. As MGnify has introduced assembly, the ME Registry has been extend to handle both assembled sequence data accessions, as well as the original sequencing project. No sequence data is held within the registry, but rather the accessioned mappings between two "equivalent" sequences sets found in each of the two broker resources (ENA and MG-RAST). Identified sequence set accessions can then be used to query the ME registry to provide access to the locations of the respective sequence sets and associated analysis results. The ME Registry consists of two types of API - an administration interface and a public read-only interface. The administration panel allows resource providers to register and manage their datasets for exchange, while the second publicly available read-only interface allows users to find and query the identified runs for mappings to metagenomics datasets. Authorisation for the registry is performed using access tokens. Each resource provider group (MG-RAST, MGnify and ENA) has their own token, which is required for all administration tasks (submit, update, delete), as well as read access to pre-publication data.

Determining sequence datasets that are equivalent in MG-RAST and ENA has proved to be challenging. One of the main issues is that MG-RAST does not store the original raw FASTQ files, but rather stores the quality controlled FASTA files. Thus, retrospective population of the ME registry was less straightforward than originally anticipated. The sorts of tools that make this activity possible are only becoming available now when metadata matching do not enable this to be achieved. We have extended the ME registry to include the method(s) used to infer equivalence: hash_of_sequence, kmer_profile, taxonomy_signature, functional_signature, gps_coordinates, biome, other_metadata. Confidence in the results will be provided and is defined as full if the sequence hashes match, high if the biome and GPS match, medium for a good combination of other fields and low for uncertain matches. While many of the older datasets will not be mapped, part of the work focuses on enabling the brokering of MG-RAST sequence datasets into ENA. This will enable the direct capture of equivalence between MG-RAST and ENA.

Both MGnify and MG-RAST have adopted the use of Common Workflow Language (CWL) for the description of their analysis pipelines in a standard fashion. To achieve this, the MG-RAST execution framework (AWE) has been extended to be able to execute CWL pipelines. The MGnify analysis pipelines have been described in CWL since version 3, and we are now currently in version 5. We have also described the MGnify assembly pipeline in CWL. This latest version of the MGnify analysis pipeline, i.e. version 5.0 has been re-worked to produce three distinct versions (amplicon, raw read, and assembly analysis). These have been rapidly built using common components (subworkflows) where appropriate. We have also successfully evaluated different execution engines for running the CWL on different compute infrastructures. Both MG-RAST and MGnify teams are promoting the use of CWL as part of the Genome Standards Consortium, as well as through contributions in commentaries. Describing pipelines in a standard format allows complete provenance of the pipeline (e.g. allowing reproducible science), simpler comparison between the two, as well as rapidly rebuilding and combining components.

To enrich search and retrieval of data from MGnify, we have developed and released a RESTful API, providing programmatic access to all of the data contained within the resource. The base address to the API gives access to several collections of resources, such as studies, samples, runs, experiment- types, biomes and annotations. Combined with appropriate relationships to other resources, these can be filtered and sorted by selected attributes, allowing complex queries to be constructed (for example: 'retrieve all oceanographic samples from metagenomic studies taken at temperatures less than 10C). The provision of such complex queries allows metadata to be combined with annotation for powerful data analysis and visualisation. We have utilised an interactive documentation framework (Swagger UI) to visualise and simplify interaction with the API's resources via an HTML interface, allowing less experienced users to interactively build up API queries. Detailed explanations of the purpose of all resources, along with many examples, are also provided to guide end-users. This, in combination with the MG-RAST API, provides the underlying mechanisms for data exchange between the resources and for disseminating the results. Consequently, the MGnify API has witnessed substantial increase in usage, receiving millions of requests per month.

First developed in 2010, the former MGnify website was not designed with modern API approaches in mind, and adopted the now antiquated design of the server directly contacting the backend database. Therefore, exposing new data types and pulling in data via the MG-RAST API was going to be extremely time consuming. To overcome this, we have completely rewritten the MGnify site in order to consume the new MGnify API (thereby reducing duplication of effort). Furthermore, the website was rebuilt in a modern framework; this included the development of a portable JavaScript library to consume the MGnify API (implemented in Backbone JS). This may be released using a public package repository in the future and will be shared with MG-RAST to enable them to consume and display MGnify outputs with minimal effort.

During the course of this project, MGnify has added metagenomic assembly as another component of the pipeline repertoire. We have each shared our experiences with metagenomics assembly, especially in terms of different algorithms performance and quality of assemblies. The MGnify team has showcased their neural network for assembly parameter estimation. We have also exchanged ideas on API design and the benefits (and drawbacks) of using standards/best practices. As part of this work, we have a technical article on API provision published in Plos Comp Biol. We have also tried to look for consistencies between our API endpoints and have a clear understanding of each other's APIs (and infrastructures). For example, MG-RAST has a Cassandra back system, where MGnify is backed via MongoDB. While we both adopt NoSQL solutions, Cassandra offers greater search functionality compared to MGnify's current MongoDB system. This limited search is being overcome by releasing software solutions that enable equivalent searches by combining API queries (e.g. the Metagenomics Tool Kit). Furthermore, we have had specific meetings describing the containerisation of our workflows. We have also exchanged ideas surrounding the use of Simka for Kmer profiling of datasets. Due to the nature of this algorithm, which removes lowly abundant Kmers (and can be the cause of small variations introduced by quality control), it has been possible to match imperfect datasets. However, we have not been able to scale the update procedure of this and are currently investigating solutions and alternatives.

We have reviewed the respective steps in our pipelines to identify commonalities and where a common solution may prove beneficial. The initial comparisons of the pipelines have indicated that the highest degree of overlap resides in the initial quality control and trimming sections. We also strongly believe that our independent approaches to functional annotation are complementary. MG-RAST provides the best match to a sequence using sequence similarity searches against a large sequence database, while MGnify provides matches to different protein family databases. As many sequences lack functional annotation, the domain annotations can be more informative, while on the other hand, the presence of certain domains does not always provide a description of the overarching function of a sequence, where a full-length match to annotated sequence would. To overcome these limitations, MGnify has adopted DIAMOND searches with UniRef90 for the annotations of their assemblies. Similar, MGnify also includes KEGG and antiSMASH annotations, which provide access to higher order annotations. Since the commencement of this project, MGnify has moved to offering assembly as a service, a capacity that MG-RAST is yet to afford. We are sharing our workflow descriptions for this process, putting CWL into practice to achieve these outcomes.
Exploitation Route Although the Metagenomics Exchange (ME) has originally been developed with MG-RAST and MGnify, the model is completely agnostic about analysis source. The only restriction is that the underlying sequence data that the analysis is based upon is found within the ENA (submitted directly or to one of the INSDC partners). This means that other metagenomics analysis resources, such as IMG/M and iMicrobe could also use the ME to expose their analysis results, making them discoverable for other research scientists.

With the current systems, we will make it simple for research scientists to know when a common dataset has been analysed in both resources. As both resources have different analysis strategies, they may highlight different features in the dataset, accelerating the rate of novel discovery. Moreover, when the results are consistent, it provides independent validation of the results.

The new MGnify website now provides a more consistent view of the data plus the associated API, providing access to the terabytes of processed data. This API is accompanied by software libraries that both illustrate the use of the API using standard libraries and can be used to access the data. At the time of writing, these libraries have been downloaded over 25,000 times.

The CWL descriptions of our pipelines allow for complete provenance of the analysis, increasing transparency of how the results were derived and how two pipelines may differ, allowing scientists to account for the differences that arise from informatics variation. Furthermore, these CWL descriptions can be taken and extended or modified (e.g. inclusion of new tools or reference databases). Our use of CWL is also driving the execution frameworks (being developed by third parties), e.g. Toil and CWLEXEC. As CWL is not confined to biology, it potentially has a very broad impact.
Description MGnify pipelines have been encapsulated using the common workflow language (CWL). Examples are now available in the WorkflowHub ( These pipelines have been used as exemplars to a wide range of communities (industrial and academic) on how to develop complex computational analysis pipelines, and how they can be used to capture the provenance of the analysis (software tools, parameters, reference databases, inputs and outputs). Since the completion of the project, new avenues have opened up for expanding the analyses that are available in MGnify, and facilitated the prototyping of Research Object crates (RO-crates) usage for sharing data with complete provenance and thereby improving open science and reproducibility. This has also sparked alternative ways of working, and provided a potential mechanism for federating workloads associated with MGnify analysis services. Furthermore, formally establishing this foundation has opened up new opportunities to collaborate with the scientific community, with MGnify pipelines forming the basis of new metagenomic analysis pipelines that have been developed and deployed on the European Open Science Cloud compute (EOSC) infrastructure. This pipeline is called MetaGOflow, which utilises a subset of the MGnify pipeline components rather than the comprehensive set of functional and taxonomic analysis to provide an indicator of the function. The MetaGOflow workflow is currently being used by the European Marine Genomic Observatories as part of their oceanographic surveillance. During the course of the project we also contributed to the open access software for executing these workflows, namely Toil, which is also used across sectors, including physics and geography. Due to continued issues faced with Toil though, we have subsequently transitioned to Nextflow. However, this work paved the way for the wider use of workflows to encapsulate complex bioinformatics analysis workflows. The taxonomic identification data generated in MGnify is now flowing into the Global Biodiversity Information Facility (GBIF), which has a completely different audience to either MGnify or MG-RAST. This data is presenting a new view on environmental biodiversity and provides the ability to connect between biodiversity and museum collections. The data in GBIF is also used to inform governmental policies across the world, with MGnify representing one the largest contributors of taxonomic identifications. For instance, GBIF ( has imported 970 MGnify studies from environmental biomes, and the data from MGnify has been used in 475 different citations (which are attributed to GBIF, not MGnify in the literature). The Metagenomics Exchange framework has been implemented to enable the discovery of different sources of metagenomics analysis. MGnify has deposited thousands of analyses results (, improving the FAIRification of data. Since the project, additional modifications have been requested that will accommodate MGnify's shift to metagenomic assemblies. Sequence assemblies are analysis objects or sequence records, but the original implementation only dealt with sequence records. We continue to engage different Stakeholders to add their data into the Metagenomics Exchange. A publication is in preparation. Subsequent to this project, we had further discussions with MG-RAST on how to integrate some of their unique features within MGnify to enable us to meet some of the demands of the UK and wider research community. For example, we have been approached about having the SEED subsystem annotations in MGnify. This work facilitated an understanding of the MG-RAST pipeline, which enabled us to rapidly evaluate the cost-benefit of introducing additional analysis components to the MGnify systems. Currently, MG-RAST is no longer receiving any active support and their future is highly uncertain.
First Year Of Impact 2020
