EMERALD - Enriching MEtagenomics Results using Artificial intelligence and Literature Data

Lead Research Organisation: EMBL - European Bioinformatics Institute
Department Name: Sequence Database Group


Microbes like bacteria and fungi inhabit diverse environments, including soil, water, and human body sites, such as the mouth, skin and intestine. Ubiquitous in nature, they also show adaptation to extreme environments, such as acid mine drainage or hydrothermal vents. We have appreciated the potential of microbes for a long time - they are important for food and beverage manufacturing (e.g. cheese and beer), and are key players in bioremediation, as demonstrated by their pivotal role in breaking down complex oils following the Deep Horizon oil spill in the Gulf of Mexico. The field of metagenomics offers an exciting opportunity to examine these microbial communities and gain insights into various aspects of their existence, i.e. their interaction with humans and plants, their potential as disease reservoirs, and as sources of novel enzymes with bioremediation or plastic recycling abilities.

Metagenomics studies microbial communities by sampling the environments directly, extracting and sequencing their genetic material (DNA), and applying computational methods to elucidate microbial composition and function. This sampling approach helps to characterise unculturable or as yet uncultured microbes in the laboratory. Metagenomics experimental data are typically large (10-100s of GBs per sequencing run; 100s of runs per project), complex (comprising 100-1000s of different microbes) and variable due to the nature of the underlying experiments and (sub-)sampling of the dynamic populations.

Despite knowledge about fluxes within a microbial community (e.g. time of year or day), metagenomic datasets typically contain poor descriptions (termed metadata) relating to the sample origin or methods used to obtain the DNA and process the sequence data. To help interpret data across experiments and derive meaningful biological conclusions, it is crucial to know whether a difference between two metagenomics datasets is due to differences in underlying experimental techniques or the biological qualities of the sample. The lack of metadata has impeded our attempts to apply machine learning (ML) techniques to interpret new incoming data, and therefore our capacity to find novel biological applications.

To circumvent these issues, our proposal aims to employ different ML methodologies to enrich the currently available metadata and start elucidating new knowledge embedded in the sequence data. The text mining approach will focus on identifying research articles on metagenomics experiments to unearth and extract detailed descriptions which will be used to enrich the metadata associated with the corresponding DNA sequences and generate new or improved classification systems. This dictionary of descriptor terms will also serve as the template for developing methods to discover previously unidentified metagenomics papers. We will train algorithms on this enriched metadata to progressively learn what criteria might be applied to incoming data with inadequate descriptions in order to determine sample origin, processing, as well as decipher which experimental biases affect the results, when comparing similar samples.

ML approaches will also be used for the discovery of new biological functions. Bacteria encode gene cassettes that are responsible for producing compounds of pharmaceutical and agricultural value. Functional descriptions for the genes constituting these cassettes are incomplete, while many cassettes still await discovery. By combining the ML and text mining approaches, we intend to better describe these cassettes and also focus on the detection of novel groups.

Data underpinning this work will originate from key EMBL-EBI databases, namely EBI Metagenomics and Europe PMC, as well as other resources (e.g. MIBiG). Developments aimed at herein will help resolve complexities underlying experimental data, enriching the metadata in the process and also laying the foundation for a new generation of reliable predictive models.

Technical Summary

The field of metagenomics is burgeoning as the technique furnishes insights into the sum total of all microbial content within particular biomes. Technological advances in sequencing methods have resulted in a data deluge - while this has afforded us access to hitherto rare microbes, the analysis is often complicated due to inconsistencies in data sampling, lack of metadata specificity, data variability for identical biomes and choice of downstream analyses tools. As datasets from metagenomics experiments are inherently noisy, detecting significant and explicit biological signals becomes challenging. Comparison across similar datasets would help detect meaningful signals, but the paucity of standardized contextual metadata, associated literature and granularity in labelling makes this difficult.

To overcome these issues, we will apply text mining (NLP) and machine learning (ML) methodologies to enrich and standardize metadata, improve functional annotations, and enhance discovery of novel secondary metabolite gene clusters (SMGC). We will identify metadata-linked terms already present in the EBI Metagenomics portal (EMG) and in full text publications in Europe PMC to develop training sets that will facilitate NLP/ML approaches for finding additional metadata. We will apply ML algorithms based on metagenomics datasets to determine biome-specific tags, enrich metadata and identify outlying datasets. Using a combination of EMG-linked data, biological relationships, and literature, we will also develop ML models that incorporate the complex rules behind cluster evolution and metabolite production. We will enhance existing SMGC descriptions via NLP approaches and use them to develop training sets for the detection of SMGCs. These ML SMGC models will then be applied to assembled metagenomics contigs to find novel SMGCs. We will also investigate clustering tools to enhance our ability to discover novel clusters based on the EMG protein sequence database.

Planned Impact

Metagenomics is a rapidly expanding field wherein the depth and breadth of data are constantly increasing. Consequently, the number of published research articles associated with the field is also growing. However, there is often a disconnect between sample, sequence data and publication. The lack of data integration has hampered the production of statistically robust, predictive models. Moreover, datasets from different groups are rarely compared, partly because experimental approaches for investigating different microbiomes are constantly evolving.

In this proposal, we plan to adopt the use of machine learning (ML) algorithms and natural language processing (NLP) to help overcome these challenges by improving metadata and developing predictions based on taxonomic and functional assignments contained within EBI metagenomics (EMG), enhanced by linking to the primary literature in Europe PMC. We will also focus on the use of both ML and NLP to enhance our ability to discover novel microbial secondary metabolite gene clusters (SMGCs) in our metagenomics assemblies. SMGCs are responsible for the production of key products, like antimicrobials and insecticides, both of great agricultural and biotechnological importance, as well as impacting human health.

Due to the widespread use of metagenomics and the position of EMG and Europe PMC, we anticipate the impact of this research to be significant. Metagenomics is widespread in research projects associated with BBSRC strategic priorities- agriculture and food security, industrial biotechnology and bioscience for health; the field represents the epitome of data driven biology. Through the application of NLP and ML, we will demonstrate how these new technologies can be utilised to help research scientists interrogate big data. Whilst we will domain focused, the technical developments within this project will have far reaching impacts, applicable to other fields and analytical disciplines. The 'use cases' in the program will cover a range of cross-cutting themes, demonstrating the general applicability of the techniques to different environments and conditions. Furthermore, the semantically marked up literature, enriched metadata and SMGC annotations will have applications in a wide range of academic and industrial fields, including enzyme discovery, environmental science, diagnostics and animal/human health.

We will ensure impact on all academic and industrial audiences by the publication of software, data, compute containers and peer reviewed articles. To address the skills shortages in the fields of metagenomics, NLP and ML, we will deliver training, webinars and participate in community workshops. Other dissemination routes include the use of networks and collaborations, conferences and social media channels. The public sector will also be engaged, via specific events and through the publication of non-specialist articles and interviews.

The outputs of the project will be of exceptional value to the commercial sector, and the benefits will eventually feed through to the public. The software and the applications there of, will lead to new discoveries such as new antibiotics for humans and livestock, higher agricultural yields from the understanding of socio-ecological interplay (e.g. food chain microbes) and expanded discovery of novel enzymes capable of operating at extremes, such as psychrophilic enzymes for detergents, or with novel catalytic functionality (e.g. anaerobic digestion pathways in biofuel production).

Combining literature and metagenomic data as in this proposal is pivotal to the notion of One Health- the collaborative effort of multiple disciplines working at national and international levels to attain optimal health for people, animals and the environment. Our proposal encapsulates this philosophy and will impact major UK and international communities, ensuring that the potential of metagenomics data is collectively realised.


10 25 50
Title Development of profile HMM library for the detection of secondary metabolites. 
Description We have been identifying proteins found in secondary metabolite gene clusters that are missing annotation by Pfam. For those proteins lacking an annotation, we have clustered these proteins to identify missing protein families. These have then be developed into protein profile hidden Markov models, which have been added to the Pfam database. We have also identified cases where Pfam domains were missing matches and improved existing models. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2020 
Impact This has increased the coverage of proteins that are found in secondary metabolite gene clusters, enabling better training of machine learning algorithms to enable the distinction of clusters compared to other coding regions of bacteria. This remains a work in progress, but we have significantly increased the coverage of bacteriocins. While this data product 
URL http://pfam.xfam.org
Title Microbiome classifier and Named-Entity Recognition 
Description Using literature-based machine learning (ML) approaches, we have been identifying key metagenomics papers, as well as their related biomes, experimental factors, and secondary metabolite gene clusters (SMGC). For identifying key metagenomics papers, we have developed supervised biome classifiers that classified publications linked to ENA metagenomics studies into host-associated, environmental and engineered metagenomics studies, as well as marine and human faecal microbiome. We generated the training dataset by linking curated metagenomics samples in MGnify with their corresponding publications and trained several Random Forests models on predicting diverse microbiomes. For recognising metagenomics metadata, we have identified bag of words representing other biome types, experimental factors and secondary metabolite gene clusters and used their contextual representations (word embeddings), which were generated from unsupervised neural networks training, to identify further biome-related metadata. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2019 
Impact Classifying and identifying metagenomics publications that cover a wide variety of microbiome helped in creating representative metagenomics triage papers for biocuration and subsequent training and refining of machine learning models.