EMERALD - Enriching MEtagenomics Results using Artificial intelligence and Literature Data

Lead Research Organisation: EMBL - European Bioinformatics Institute
Department Name: Genome Assembly and Annotation


Microbes like bacteria and fungi inhabit diverse environments, including soil, water, and human body sites, such as the mouth, skin and intestine. Ubiquitous in nature, they also show adaptation to extreme environments, such as acid mine drainage or hydrothermal vents. We have appreciated the potential of microbes for a long time - they are important for food and beverage manufacturing (e.g. cheese and beer), and are key players in bioremediation, as demonstrated by their pivotal role in breaking down complex oils following the Deep Horizon oil spill in the Gulf of Mexico. The field of metagenomics offers an exciting opportunity to examine these microbial communities and gain insights into various aspects of their existence, i.e. their interaction with humans and plants, their potential as disease reservoirs, and as sources of novel enzymes with bioremediation or plastic recycling abilities.

Metagenomics studies microbial communities by sampling the environments directly, extracting and sequencing their genetic material (DNA), and applying computational methods to elucidate microbial composition and function. This sampling approach helps to characterise unculturable or as yet uncultured microbes in the laboratory. Metagenomics experimental data are typically large (10-100s of GBs per sequencing run; 100s of runs per project), complex (comprising 100-1000s of different microbes) and variable due to the nature of the underlying experiments and (sub-)sampling of the dynamic populations.

Despite knowledge about fluxes within a microbial community (e.g. time of year or day), metagenomic datasets typically contain poor descriptions (termed metadata) relating to the sample origin or methods used to obtain the DNA and process the sequence data. To help interpret data across experiments and derive meaningful biological conclusions, it is crucial to know whether a difference between two metagenomics datasets is due to differences in underlying experimental techniques or the biological qualities of the sample. The lack of metadata has impeded our attempts to apply machine learning (ML) techniques to interpret new incoming data, and therefore our capacity to find novel biological applications.

To circumvent these issues, our proposal aims to employ different ML methodologies to enrich the currently available metadata and start elucidating new knowledge embedded in the sequence data. The text mining approach will focus on identifying research articles on metagenomics experiments to unearth and extract detailed descriptions which will be used to enrich the metadata associated with the corresponding DNA sequences and generate new or improved classification systems. This dictionary of descriptor terms will also serve as the template for developing methods to discover previously unidentified metagenomics papers. We will train algorithms on this enriched metadata to progressively learn what criteria might be applied to incoming data with inadequate descriptions in order to determine sample origin, processing, as well as decipher which experimental biases affect the results, when comparing similar samples.

ML approaches will also be used for the discovery of new biological functions. Bacteria encode gene cassettes that are responsible for producing compounds of pharmaceutical and agricultural value. Functional descriptions for the genes constituting these cassettes are incomplete, while many cassettes still await discovery. By combining the ML and text mining approaches, we intend to better describe these cassettes and also focus on the detection of novel groups.

Data underpinning this work will originate from key EMBL-EBI databases, namely EBI Metagenomics and Europe PMC, as well as other resources (e.g. MIBiG). Developments aimed at herein will help resolve complexities underlying experimental data, enriching the metadata in the process and also laying the foundation for a new generation of reliable predictive models.

Technical Summary

The field of metagenomics is burgeoning as the technique furnishes insights into the sum total of all microbial content within particular biomes. Technological advances in sequencing methods have resulted in a data deluge - while this has afforded us access to hitherto rare microbes, the analysis is often complicated due to inconsistencies in data sampling, lack of metadata specificity, data variability for identical biomes and choice of downstream analyses tools. As datasets from metagenomics experiments are inherently noisy, detecting significant and explicit biological signals becomes challenging. Comparison across similar datasets would help detect meaningful signals, but the paucity of standardized contextual metadata, associated literature and granularity in labelling makes this difficult.

To overcome these issues, we will apply text mining (NLP) and machine learning (ML) methodologies to enrich and standardize metadata, improve functional annotations, and enhance discovery of novel secondary metabolite gene clusters (SMGC). We will identify metadata-linked terms already present in the EBI Metagenomics portal (EMG) and in full text publications in Europe PMC to develop training sets that will facilitate NLP/ML approaches for finding additional metadata. We will apply ML algorithms based on metagenomics datasets to determine biome-specific tags, enrich metadata and identify outlying datasets. Using a combination of EMG-linked data, biological relationships, and literature, we will also develop ML models that incorporate the complex rules behind cluster evolution and metabolite production. We will enhance existing SMGC descriptions via NLP approaches and use them to develop training sets for the detection of SMGCs. These ML SMGC models will then be applied to assembled metagenomics contigs to find novel SMGCs. We will also investigate clustering tools to enhance our ability to discover novel clusters based on the EMG protein sequence database.

Planned Impact

Metagenomics is a rapidly expanding field wherein the depth and breadth of data are constantly increasing. Consequently, the number of published research articles associated with the field is also growing. However, there is often a disconnect between sample, sequence data and publication. The lack of data integration has hampered the production of statistically robust, predictive models. Moreover, datasets from different groups are rarely compared, partly because experimental approaches for investigating different microbiomes are constantly evolving.

In this proposal, we plan to adopt the use of machine learning (ML) algorithms and natural language processing (NLP) to help overcome these challenges by improving metadata and developing predictions based on taxonomic and functional assignments contained within EBI metagenomics (EMG), enhanced by linking to the primary literature in Europe PMC. We will also focus on the use of both ML and NLP to enhance our ability to discover novel microbial secondary metabolite gene clusters (SMGCs) in our metagenomics assemblies. SMGCs are responsible for the production of key products, like antimicrobials and insecticides, both of great agricultural and biotechnological importance, as well as impacting human health.

Due to the widespread use of metagenomics and the position of EMG and Europe PMC, we anticipate the impact of this research to be significant. Metagenomics is widespread in research projects associated with BBSRC strategic priorities- agriculture and food security, industrial biotechnology and bioscience for health; the field represents the epitome of data driven biology. Through the application of NLP and ML, we will demonstrate how these new technologies can be utilised to help research scientists interrogate big data. Whilst we will domain focused, the technical developments within this project will have far reaching impacts, applicable to other fields and analytical disciplines. The 'use cases' in the program will cover a range of cross-cutting themes, demonstrating the general applicability of the techniques to different environments and conditions. Furthermore, the semantically marked up literature, enriched metadata and SMGC annotations will have applications in a wide range of academic and industrial fields, including enzyme discovery, environmental science, diagnostics and animal/human health.

We will ensure impact on all academic and industrial audiences by the publication of software, data, compute containers and peer reviewed articles. To address the skills shortages in the fields of metagenomics, NLP and ML, we will deliver training, webinars and participate in community workshops. Other dissemination routes include the use of networks and collaborations, conferences and social media channels. The public sector will also be engaged, via specific events and through the publication of non-specialist articles and interviews.

The outputs of the project will be of exceptional value to the commercial sector, and the benefits will eventually feed through to the public. The software and the applications there of, will lead to new discoveries such as new antibiotics for humans and livestock, higher agricultural yields from the understanding of socio-ecological interplay (e.g. food chain microbes) and expanded discovery of novel enzymes capable of operating at extremes, such as psychrophilic enzymes for detergents, or with novel catalytic functionality (e.g. anaerobic digestion pathways in biofuel production).

Combining literature and metagenomic data as in this proposal is pivotal to the notion of One Health- the collaborative effort of multiple disciplines working at national and international levels to attain optimal health for people, animals and the environment. Our proposal encapsulates this philosophy and will impact major UK and international communities, ensuring that the potential of metagenomics data is collectively realised.
Description We have developed a machine learning framework to identify metagenomics sample and experimental metadata in publications describing the research. These metadata annotations have been made available via the Europe PMC Annotations platform and can be accessed both in Europe PMC (article view) and the Annotations API. These metadata annotations encompass 16 novel entity types, such as host, ecoregion, engineered (environment), body site, place, kit, primer, sequencing (platform), state and treatment. This framework has enabled the extraction of metadata from 114,099 publications in Europe PMC, including 19,900 publications describing metagenomics studies in ENA and MGnify. The metadata annotations are also pulled into MGnify, and displayed along side user submitted metadata. Such information is critical for understanding difference that occurs between different metagenomics datasets, which could be caused by different experimental approaches and/or different states (e.g. healthy vs diseases).

In an additional activity, we have also mined literature to identify papers describing secondary metabolite biosynthetic gene clusters (BGC). Current databases that catalogue BGCs are incomplete, and rely on curators identifying papers. The approach has streamlined this activity, allowing the key pieces of information about the genome producing the natural product, the natural product and the activity/mode of action of the natural produce. These data have also been used by computational groups to validate BGC prediction tools.

In a parallel thread of work, we have developed a novel method for detecting and classifying secondary metabolite biosynthetic gene clusters (BGC) in bacterial genomes. The method is composed of three modules: (1) A preprocessing module that functionally annotates proteins using InterProScan. This increases the total number of annotated proteins compared to other detection methods; (2) A detection module which consists of an artificial neural network trained with a time-series loss function, which helps to overcome the class imbalance in BGC datasets; (3) A post-processing module that filters out low probability detected regions and classifies the remaining into one of the seven categories used in MiBIG DB. The method has been implemented as a Python package called EMERALD, publicly available at https://github.com/Finn-Lab/emeraldBGC and bioconda. The performance comparison to other methods (antiSMASH, deepBGC, and GECCO) in 2 datasets independent from the training dataset (a real genomic dataset and a synthetic dataset), revealed that EMERALD is the tool with the highest F2 Score (i.e. it has a high recall without losing precision) in both datasets (0.77 and 0.87), and highest F1 score (best overall performance) in the real genomic dataset (0.76). Evaluation on a synthetic metagenomic assembly showed that EMERALD has the best performance (F1=0.82 and F2=0.79). We used EMERALD to predict BGCs in a set of 24,408 metagenomic assemblies (from MGnify) spanning different biomes. This resulted in the identification of 1,144,466 putative BGCs from different classes. These results offer a unique resource of BGCs from uncultured microorganisms. We are in the process of finalising the manuscript of this work for publication.
Exploitation Route The additional metagenomics annotations in Europe PMC allow researchers to quickly identify metadata in full text articles. These metadata fields are often missing from the submitted sequence records, yet can be essential for understanding confounding experimental factors when two different studies are compared. Such mark-up will increase the reuse of generated datasets, and enrich the MGnify and BioSample databases. This methodology for marking up literature could be extended to other types of papers. The integration of metagenomics annotations pipeline into EuropePMC and MGnify will also enable computational biologists to conduct accurate meta-analysis on wide range of longitudinal and cross-sectional metagenomics studies to unveil the role of microbial communities in environmental phenomena as well as health and disease.

BGCs annotations pipeline managed to identify novel BGCs from literature about pollutants degrading enzymes and novel compounds with antitumor, antiviral and antibacterial activities that can benefit researchers tackling environment pollutants or working on novel drugs for untreatable diseases.

The SMBGC predictions by EMERALD BGC is providing a unique view on the repertoire encoded by prokaryotes. While these predictions require experimental validation, they may give rise to a broader spectrum of novel antimicrobial compounds that may be applied in a variety of settings, from health care to food preservation. As the gene clusters are self-contained cassettes required for metabolite production and that there is an increasing trend for these to be applied to metagenomics where the source organism may have never been isolated, the increased accuracy of the boundaries that our tool provides is already generating interest from the broader community. Having the cassette more accurately defined means the de novo synthesis of the cassette is more efficient. As part of this work, we have generated additional profile HMMs to represent some of unannotated genes in the SMBGC, which have been added to Pfam, the protein families databases. These additional entries will be used for genome annotation and help others trying to annotate SMBGC.

Additional outcomes are likely, but the award is still active.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

Title Novel tool for identifying biosynthetic gene clusters (BGCs) from literature 
Description A new machine learning framework has been developed for enriching MiBiG database with novel Biosynthetic Gene clusters (BGCs) from literature. A total of 7 deep learning models have been trained to recognize 7 BGCs entities in EuropePMC publications. The framework is publicly available for users on Gitlab. 
Type Of Material Improvements to research infrastructure 
Year Produced 2022 
Provided To Others? Yes  
Impact A set of articles for the MiBIG database curators to add to their database. 
URL https://gitlab.com/maaly7/emerald_bgcs_annotations
Title Literature training datasets 
Description Metagenomics and BGCs curated datasets has been released into https://gitlab.com/maaly7/emerald_metagenomics_annotations (140 publication) and https://gitlab.com/maaly7/emerald_bgcs_annotations (150 publication), respectively. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact Benchmarking datasets for training machine learning methods on. 
URL https://gitlab.com/maaly7/emerald_metagenomics_annotations
Description MiBIG interaction 
Organisation Wageningen University & Research
Country Netherlands 
Sector Academic/University 
PI Contribution A collaboration was established with MiBIG database (https://mibig.secondarymetabolites.org/) team to validate and deposit a total of 1489 new Biosynthetic Gene Clusters (BGCs) that were generated from a new machine learning framework developed for enriching BGCs metadata.
Collaborator Contribution MiBIG database has been used to provide training data for our software tools.
Impact These publications will be used by annotators for the
Start Year 2022
Title Development of profile HMM library for the detection of secondary metabolites. 
Description We have been identifying proteins found in secondary metabolite gene clusters that are missing annotation by Pfam. For those proteins lacking an annotation, we have clustered these proteins to identify missing protein families. These have then be developed into protein profile hidden Markov models, which have been added to the Pfam database. We have also identified cases where Pfam domains were missing matches and improved existing models. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2020 
Impact This has increased the coverage of proteins that are found in secondary metabolite gene clusters, enabling better training of machine learning algorithms to enable the distinction of clusters compared to other coding regions of bacteria. This remains a work in progress, but we have significantly increased the coverage of bacteriocins. While this data product 
URL http://pfam.xfam.org
Title EmeraldBGC 
Description https://github.com/Finn-Lab/emeraldBGC 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact EmeraldBGC is a novel machine learning tool for detecting and classifying secondary metabolite biosynthetic gene clusters (BGC) in bacterial genomes and metagenomes. 
Title Machine Learning framework 
Description This new ML framework includes: 1) Literature classification and triage, 2) Defining novel metagenomics entities and curation, 3) Training Named Entity recognition (NER) models (BERT) and NER prediction and 4) Databases enrichment. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2020 
Impact This new ML framework makes it easier to extract data pertinent to a wide range of metagenomics studies from the Europe PMC literature repository. 
Title Microbiome classifier and Named-Entity Recognition 
Description Using literature-based machine learning (ML) approaches, we have been identifying key metagenomics papers, as well as their related biomes, experimental factors, and secondary metabolite gene clusters (SMGC). For identifying key metagenomics papers, we have developed supervised biome classifiers that classified publications linked to ENA metagenomics studies into host-associated, environmental and engineered metagenomics studies, as well as marine and human faecal microbiome. We generated the training dataset by linking curated metagenomics samples in MGnify with their corresponding publications and trained several Random Forests models on predicting diverse microbiomes. For recognising metagenomics metadata, we have identified bag of words representing other biome types, experimental factors and secondary metabolite gene clusters and used their contextual representations (word embeddings), which were generated from unsupervised neural networks training, to identify further biome-related metadata. 
Type Of Technology New/Improved Technique/Technology 
Year Produced 2019 
Impact Classifying and identifying metagenomics publications that cover a wide variety of microbiome helped in creating representative metagenomics triage papers for biocuration and subsequent training and refining of machine learning models. 
Description Proceedings presentation and poster at ISMB2021 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact "A machine learning framework for discovering and enriching metagenomics metadata from open access research articles" was accepted as proceedings presentation and poster at ISMB2021 (https://www.iscb.org/cms_addon/conferences/ismbeccb2021/tracks/textmining) and is currently available on ISCBtv (https://www.youtube.com/c/ISCBtv/featured)
Year(s) Of Engagement Activity 2021
URL https://www.iscb.org/cms_addon/conferences/ismbeccb2021/tracks/textmining