EMERALD - Enriching MEtagenomics Results using Artificial intelligence and Literature Data

Lead Research Organisation: European Bioinformatics Institute

Department Name: Genome Assembly and Annotation

Abstract

Microbes like bacteria and fungi inhabit diverse environments, including soil, water, and human body sites, such as the mouth, skin and intestine. Ubiquitous in nature, they also show adaptation to extreme environments, such as acid mine drainage or hydrothermal vents. We have appreciated the potential of microbes for a long time - they are important for food and beverage manufacturing (e.g. cheese and beer), and are key players in bioremediation, as demonstrated by their pivotal role in breaking down complex oils following the Deep Horizon oil spill in the Gulf of Mexico. The field of metagenomics offers an exciting opportunity to examine these microbial communities and gain insights into various aspects of their existence, i.e. their interaction with humans and plants, their potential as disease reservoirs, and as sources of novel enzymes with bioremediation or plastic recycling abilities.

Metagenomics studies microbial communities by sampling the environments directly, extracting and sequencing their genetic material (DNA), and applying computational methods to elucidate microbial composition and function. This sampling approach helps to characterise unculturable or as yet uncultured microbes in the laboratory. Metagenomics experimental data are typically large (10-100s of GBs per sequencing run; 100s of runs per project), complex (comprising 100-1000s of different microbes) and variable due to the nature of the underlying experiments and (sub-)sampling of the dynamic populations.

Despite knowledge about fluxes within a microbial community (e.g. time of year or day), metagenomic datasets typically contain poor descriptions (termed metadata) relating to the sample origin or methods used to obtain the DNA and process the sequence data. To help interpret data across experiments and derive meaningful biological conclusions, it is crucial to know whether a difference between two metagenomics datasets is due to differences in underlying experimental techniques or the biological qualities of the sample. The lack of metadata has impeded our attempts to apply machine learning (ML) techniques to interpret new incoming data, and therefore our capacity to find novel biological applications.

To circumvent these issues, our proposal aims to employ different ML methodologies to enrich the currently available metadata and start elucidating new knowledge embedded in the sequence data. The text mining approach will focus on identifying research articles on metagenomics experiments to unearth and extract detailed descriptions which will be used to enrich the metadata associated with the corresponding DNA sequences and generate new or improved classification systems. This dictionary of descriptor terms will also serve as the template for developing methods to discover previously unidentified metagenomics papers. We will train algorithms on this enriched metadata to progressively learn what criteria might be applied to incoming data with inadequate descriptions in order to determine sample origin, processing, as well as decipher which experimental biases affect the results, when comparing similar samples.

ML approaches will also be used for the discovery of new biological functions. Bacteria encode gene cassettes that are responsible for producing compounds of pharmaceutical and agricultural value. Functional descriptions for the genes constituting these cassettes are incomplete, while many cassettes still await discovery. By combining the ML and text mining approaches, we intend to better describe these cassettes and also focus on the detection of novel groups.

Data underpinning this work will originate from key EMBL-EBI databases, namely EBI Metagenomics and Europe PMC, as well as other resources (e.g. MIBiG). Developments aimed at herein will help resolve complexities underlying experimental data, enriching the metadata in the process and also laying the foundation for a new generation of reliable predictive models.

Technical Summary

The field of metagenomics is burgeoning as the technique furnishes insights into the sum total of all microbial content within particular biomes. Technological advances in sequencing methods have resulted in a data deluge - while this has afforded us access to hitherto rare microbes, the analysis is often complicated due to inconsistencies in data sampling, lack of metadata specificity, data variability for identical biomes and choice of downstream analyses tools. As datasets from metagenomics experiments are inherently noisy, detecting significant and explicit biological signals becomes challenging. Comparison across similar datasets would help detect meaningful signals, but the paucity of standardized contextual metadata, associated literature and granularity in labelling makes this difficult.

To overcome these issues, we will apply text mining (NLP) and machine learning (ML) methodologies to enrich and standardize metadata, improve functional annotations, and enhance discovery of novel secondary metabolite gene clusters (SMGC). We will identify metadata-linked terms already present in the EBI Metagenomics portal (EMG) and in full text publications in Europe PMC to develop training sets that will facilitate NLP/ML approaches for finding additional metadata. We will apply ML algorithms based on metagenomics datasets to determine biome-specific tags, enrich metadata and identify outlying datasets. Using a combination of EMG-linked data, biological relationships, and literature, we will also develop ML models that incorporate the complex rules behind cluster evolution and metabolite production. We will enhance existing SMGC descriptions via NLP approaches and use them to develop training sets for the detection of SMGCs. These ML SMGC models will then be applied to assembled metagenomics contigs to find novel SMGCs. We will also investigate clustering tools to enhance our ability to discover novel clusters based on the EMG protein sequence database.

Planned Impact

Metagenomics is a rapidly expanding field wherein the depth and breadth of data are constantly increasing. Consequently, the number of published research articles associated with the field is also growing. However, there is often a disconnect between sample, sequence data and publication. The lack of data integration has hampered the production of statistically robust, predictive models. Moreover, datasets from different groups are rarely compared, partly because experimental approaches for investigating different microbiomes are constantly evolving.

In this proposal, we plan to adopt the use of machine learning (ML) algorithms and natural language processing (NLP) to help overcome these challenges by improving metadata and developing predictions based on taxonomic and functional assignments contained within EBI metagenomics (EMG), enhanced by linking to the primary literature in Europe PMC. We will also focus on the use of both ML and NLP to enhance our ability to discover novel microbial secondary metabolite gene clusters (SMGCs) in our metagenomics assemblies. SMGCs are responsible for the production of key products, like antimicrobials and insecticides, both of great agricultural and biotechnological importance, as well as impacting human health.

Due to the widespread use of metagenomics and the position of EMG and Europe PMC, we anticipate the impact of this research to be significant. Metagenomics is widespread in research projects associated with BBSRC strategic priorities- agriculture and food security, industrial biotechnology and bioscience for health; the field represents the epitome of data driven biology. Through the application of NLP and ML, we will demonstrate how these new technologies can be utilised to help research scientists interrogate big data. Whilst we will domain focused, the technical developments within this project will have far reaching impacts, applicable to other fields and analytical disciplines. The 'use cases' in the program will cover a range of cross-cutting themes, demonstrating the general applicability of the techniques to different environments and conditions. Furthermore, the semantically marked up literature, enriched metadata and SMGC annotations will have applications in a wide range of academic and industrial fields, including enzyme discovery, environmental science, diagnostics and animal/human health.

We will ensure impact on all academic and industrial audiences by the publication of software, data, compute containers and peer reviewed articles. To address the skills shortages in the fields of metagenomics, NLP and ML, we will deliver training, webinars and participate in community workshops. Other dissemination routes include the use of networks and collaborations, conferences and social media channels. The public sector will also be engaged, via specific events and through the publication of non-specialist articles and interviews.

The outputs of the project will be of exceptional value to the commercial sector, and the benefits will eventually feed through to the public. The software and the applications there of, will lead to new discoveries such as new antibiotics for humans and livestock, higher agricultural yields from the understanding of socio-ecological interplay (e.g. food chain microbes) and expanded discovery of novel enzymes capable of operating at extremes, such as psychrophilic enzymes for detergents, or with novel catalytic functionality (e.g. anaerobic digestion pathways in biofuel production).

Combining literature and metagenomic data as in this proposal is pivotal to the notion of One Health- the collaborative effort of multiple disciplines working at national and international levels to attain optimal health for people, animals and the environment. Our proposal encapsulates this philosophy and will impact major UK and international communities, ensuring that the potential of metagenomics data is collectively realised.

Funded Value:

£606,286

Funded Period:

Apr 19 - Aug 22

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/S009043/1

Principal Investigator:

Robert Finn

Research Subject:

Info. & commun. Technol. (45%)

Omic sciences & technologies (18%)

Tools, technologies & methods (36%)

Research Topic:

Artificial Intelligence (18%)

Environmental Informatics (18%)

Environmental biotechnology (18%)

Genomics (18%)

Information & Knowledge Mgmt (27%)

Organisations

People	ORCID iD
Robert Finn (Principal Investigator)
Johanna McEntyre (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Nassar M (2022) A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications. in GigaScience

Nassar M (2022) A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications

Key Findings
Policy Influence
Research Databases and Models
Research Tools and Methods
Collaboration
Software and Technical Products
Engagement Activities


Description	We have developed a machine learning (ML) framework to identify sample and experimental metadata in publications that describe metagenomics research. These metadata annotations have been made available via the Europe PMC Annotations platform and can be accessed both in Europe PMC (article view) and the Annotations API. These metadata annotations encompass 16 novel entity types, such as host, ecoregion, engineered (environment), body site, place, kit, primer, sequencing (platform), state, and treatment. During the funding period, this framework enabled the extraction of metadata from 114,099 publications in Europe PMC, including 19,900 publications describing metagenomics studies in ENA and MGnify. The metadata annotations are also pulled into MGnify and displayed alongside user submitted metadata. Such information is critical for understanding differences that occur between different metagenomics datasets, which could be caused by different experimental approaches and/or different states (e.g. healthy vs diseased state). We have extended the MGnify website to dynamically pull this data so that it can be displayed next to the relevant analysis in the website. When the user drills down to the metadata terms of interest, we enable the sentence that contains the metadata term to be viewed, providing the context of the term in the research article. This helps distinguish useful metadata terms from those that may be more generic in a broader context, such as those which may be found in an introduction or discussion. In an additional activity, we have also mined literature to identify papers describing secondary metabolite biosynthetic gene clusters (BGC). Current databases that catalogue BGCs are incomplete and rely on curators identifying papers. Our literature mining approach has streamlined this activity, identifying the key pieces of information about the genome producing the natural product, the natural product itself, and the activity/mode of action of the natural product. We have shared these articles with the producers of the MiBIG resource, which collects and curates experimentally verified BGCs to provide a rich dataset that is commonly used for ML tool training. These data have also been used by computational groups to validate BGC prediction tools. In a parallel thread of work, we have developed a novel method for detecting and classifying secondary metabolite BGCs in bacterial genomes. This new method is composed of three modules: (1) A pre-processing module that functionally annotates proteins using InterProScan, which increases the total number of annotated proteins compared to other detection methods; (2) A detection module which consists of an artificial neural network trained with a time-series loss function that helps to overcome the class imbalance in BGC datasets; (3) A post-processing module that filters out low probability regions detected and classifies the remaining into one of the seven categories used in MiBIG database. The method has been implemented as a Python package called SanntiS, which is publicly available at https://github.com/Finn-Lab/sanntis and can be installed via bioconda and/or using a Docker container. Comparing performance to other methods (antiSMASH, deepBGC, and GECCO) in 2 datasets independent from the training dataset (a real genomic dataset and a synthetic dataset), revealed that SanntiS achieved the highest F2 Score (i.e. it has a high recall without losing precision) in both datasets (0.77 and 0.87), and highest F1 score (best overall performance) in the real genomic dataset (0.76). Evaluation on a synthetic metagenomic assembly showed that SanntiS has the best performance (F1=0.82 and F2=0.79). We used SanntiS to predict BGCs in a set of 24,408 MGnify metagenomic assemblies spanning different biomes, which resulted in the identification of 1,144,466 putative BGCs from different classes. These results therefore offer a unique resource of BGCs from uncultured microorganisms. We have also developed a mechanism that allows the incorporation of SanntiS predictions into the contig visualisation tool available from the MGnify website. We are in the process of finalising this entire work into a manuscript for publication. Finally, we have developed a ML based tool to aid the prediction of biomes based on taxonomic or functional content. At the time of the grant submission, MGnify labelled the biome of a study at the level of the study as historically, samples connected to a study would be from the same environmental source or biome, e.g. human gut. However, as the size of studies have increased, there are increasing numbers of cases where a study will contain many different biomes, for example human associated samples and samples from the built environment. To address this caveat, the aim of this ML tool is to predict the biome of the sample based on the analysis results produced from MGnify. To do so, we have trained the tool on MGnify datasets and used multiple rounds of cross validation to identify those instances where the taxonomic and/or functional classifications are highly accurate. For instance, the classifier works extremely well with animal host-associated biomes , with many studies being more specifically labelled. This tool will be applied to all samples in MGnify and results will be displayed on the website, with indicators on the confidence of the classification. We have started to examine in detail those biomes that differ substantially compared to their originally assigned label. There can be ambiguities for certain samples, especially for engineered environments, where composition can often show similarities to other environments (e.g. wastewater and human gut microbiomes). Others have indicated cases where a sample is clearly mislabelled, heavily contaminated, or represents a negative control. Providing these additional annotations will facilitate a finer grained biome classification in MGnify.
Exploitation Route	Metadata fields are often missing from the submitted sequence records but are essential for understanding both the biological context and confounding experimental factors when two different studies are compared. The additional metagenomics annotations in Europe PMC now allows researchers to quickly identify metadata in full text articles. . Such mark-ups will increase the reuse of generated datasets and further enrich the MGnify and BioSample databases. This methodology for marking up literature could be extended to other types of papers. The integration of metagenomics annotations pipeline into Europe PMC which are subsequently pulled into MGnify will also enable computational biologists to conduct accurate meta-analysis on wide range of longitudinal and cross-sectional metagenomics studies to unveil the role of microbial communities in environmental phenomena as well as health and disease. The BGC annotations pipeline managed to identify novel BGCs from literature on pollutant degrading enzymes and novel compounds with antitumour, antiviral and antibacterial activities, which can benefit researchers tackling environment pollutants or working on novel drugs for untreatable diseases, respectively. The BGC predictions by our new tool SanntiS is providing a unique view on the repertoire encoded by prokaryotes. While these predictions require experimental validation, they nevertheless represent the first step in providing access to a broader untapped spectrum of novel antimicrobial compounds that may be applied in a variety of settings ranging from health care to food preservation. These gene clusters are self-contained cassettes required for metabolite production and since there is an increasing trend for these to be identified in metagenomics analyses where the source organism may have never been isolated, the increased boundary accuracy provided by SanntiS has already generated interest from the broader community. Having the cassette accurately defined facilitates efficient de novo synthesis of the cassette for subsequent analysis. As part of this work, we have generated additional profile HMMs to represent some of unannotated genes in the BGC and added them to Pfam, the protein families database. These additional entries will be used for genome annotation and help others trying to annotate BGCs. Clustering of the BGCs also allows the identification of the core components as well as the accessory genes of a BGC, the latter which are likely to give rise to subtly different natural products. Developing a deeper understanding of the gene diversity connected to BGCs allows for the rational design of new BGCs in silico, which may give rise to a broader range of natural products. The additional biome annotations will improve the data quality contained in MGnify. Examples of applications include users wishing to find similar datasets for inclusion in a meta-analysis or those wishing to simply discover analysis results according to environments.
Sectors	Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Pharmaceuticals and Medical Biotechnology


Description	Member, UKRI Knowledge Transfer Network (KTN) Microbiome Innovation Network
Geographic Reach	National
Policy Influence Type	Participation in a guidance/advisory committee


Title	Novel tool for identifying biosynthetic gene clusters (BGCs) from literature
Description	A new machine learning framework has been developed for enriching MiBiG database with novel Biosynthetic Gene clusters (BGCs) from literature. A total of 7 deep learning models have been trained to recognize 7 BGCs entities in EuropePMC publications. The framework is publicly available for users on Gitlab.
Type Of Material	Improvements to research infrastructure
Year Produced	2022
Provided To Others?	Yes
Impact	A set of articles for the MiBIG database curators to add to their database.
URL	https://gitlab.com/maaly7/emerald_bgcs_annotations


Title	SanntiS
Description	We have developed a new biosynthetic gene cluster (BGC) prediction tool called SanntiS.
Type Of Material	Improvements to research infrastructure
Year Produced	2022
Provided To Others?	Yes
Impact	Running SanntiS against all of the MGnify assemblies has generated over 1.2M novel predictions, corresponding to ~24,000 distinct gene clusters.
URL	https://github.com/Finn-Lab/SanntiS


Title	Literature training datasets
Description	Metagenomics and BGCs curated datasets has been released into https://gitlab.com/maaly7/emerald_metagenomics_annotations (140 publication) and https://gitlab.com/maaly7/emerald_bgcs_annotations (150 publication), respectively.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
Impact	Benchmarking datasets for training machine learning methods on.
URL	https://gitlab.com/maaly7/emerald_metagenomics_annotations


Title	Supporting data for "A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications"
Description	Metagenomics is a culture-independent method for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally/taxonomically), either from a longitudinal study or cross-sectional studies can provide clues into how the microbiota has adapted to the environment. However, a recurring challenge, especially when comparing results between independent studies, is that key metadata about the sample and molecular methods used to extract and sequence the genetic material are often missing from sequence records, making it difficult to account for confounding factors. Nevertheless, this missing metadata may be found in the narrative of publications describing the research. Here, we describe a machine learning framework that automatically extracts essential metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework has enabled the extraction of metadata from 114,099 publications in Europe PMC, including 19,900 publications describing metagenomics studies in ENA and MGnify.
Type Of Material	Database/Collection of data
Year Produced	2022
Provided To Others?	Yes
URL	http://gigadb.org/dataset/102235


Description	MiBIG interaction
Organisation	Wageningen University & Research
Country	Netherlands
Sector	Academic/University
PI Contribution	A collaboration was established with MiBIG database (https://mibig.secondarymetabolites.org/) team to validate and deposit a total of 1489 new Biosynthetic Gene Clusters (BGCs) that were generated from a new machine learning framework developed for enriching BGCs metadata.
Collaborator Contribution	MiBIG database has been used to provide training data for our software tools.
Impact	These publications will be used by annotators for the
Start Year	2022


Title	Development of profile HMM library for the detection of secondary metabolites.
Description	We have been identifying proteins found in secondary metabolite gene clusters that are missing annotation by Pfam. For those proteins lacking an annotation, we have clustered these proteins to identify missing protein families. These have then be developed into protein profile hidden Markov models, which have been added to the Pfam database. We have also identified cases where Pfam domains were missing matches and improved existing models.
Type Of Technology	New/Improved Technique/Technology
Year Produced	2020
Impact	This has increased the coverage of proteins that are found in secondary metabolite gene clusters, enabling better training of machine learning algorithms to enable the distinction of clusters compared to other coding regions of bacteria. This remains a work in progress, but we have significantly increased the coverage of bacteriocins. While this data product
URL	http://pfam.xfam.org


Title	EmeraldBGC
Description	https://github.com/Finn-Lab/emeraldBGC
Type Of Technology	Software
Year Produced	2022
Open Source License?	Yes
Impact	EmeraldBGC is a novel machine learning tool for detecting and classifying secondary metabolite biosynthetic gene clusters (BGC) in bacterial genomes and metagenomes.


Title	Machine Learning framework
Description	This new ML framework includes: 1) Literature classification and triage, 2) Defining novel metagenomics entities and curation, 3) Training Named Entity recognition (NER) models (BERT) and NER prediction and 4) Databases enrichment.
Type Of Technology	New/Improved Technique/Technology
Year Produced	2020
Impact	This new ML framework makes it easier to extract data pertinent to a wide range of metagenomics studies from the Europe PMC literature repository.


Title	Microbiome classifier and Named-Entity Recognition
Description	Using literature-based machine learning (ML) approaches, we have been identifying key metagenomics papers, as well as their related biomes, experimental factors, and secondary metabolite gene clusters (SMGC). For identifying key metagenomics papers, we have developed supervised biome classifiers that classified publications linked to ENA metagenomics studies into host-associated, environmental and engineered metagenomics studies, as well as marine and human faecal microbiome. We generated the training dataset by linking curated metagenomics samples in MGnify with their corresponding publications and trained several Random Forests models on predicting diverse microbiomes. For recognising metagenomics metadata, we have identified bag of words representing other biome types, experimental factors and secondary metabolite gene clusters and used their contextual representations (word embeddings), which were generated from unsupervised neural networks training, to identify further biome-related metadata.
Type Of Technology	New/Improved Technique/Technology
Year Produced	2019
Impact	Classifying and identifying metagenomics publications that cover a wide variety of microbiome helped in creating representative metagenomics triage papers for biocuration and subsequent training and refining of machine learning models.


Description	"What metagenomic data can tell us about healing the planet" talk at the Life Science Across the Globe - talks on science and culture
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Policymakers/politicians
Results and Impact	Talk by PI Rob Finn on MGnify at the Learning from the planet to heal the planet: Microbial Ecosystems online seminar series (hosted by EMBL and HHMI Janelia Research Campus).
Year(s) Of Engagement Activity	2022
URL	https://www.youtube.com/watch?v=Hc89Rrs_ykY&ab_channel=HHMI%27sJaneliaResearchCampus


Description	BIOPROSP_23 Keynote talk "Genome Resolved Metagenomics - Understanding the potential of marine microbial communities for novel product discovery"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Keynote talk by PI Rob Finn at the BIOPROSP-23 conference held at Tromsø, Norway. BIOPROSP is the international biennial scientific conference on marine biotechnology, which aims to translate basic research into applied research with industrial application.
Year(s) Of Engagement Activity	2023
URL	https://www.tekna.no/en/events/bioprosp_23-42323/Program/?info=156913


Description	ETIM 2022 talk "Genome resolved metagenomics: understanding the metabolic potential of microbial communities"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Talk by MGnify PI Rob Finn at the ETIM 2022 meeting on Artificial Intelligence and Bioinformatics held at Essen
Year(s) Of Engagement Activity	2022
URL	https://etim.uk-essen.de


Description	ISME 18 Roundtable "What does it take to be FAIR?" by the National Microbiome Data Collaborative
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Roundtable organised by the National Microbiome Data Collaborative at ISME18. PI Rob Finn was an expert panelist on the roundtable. Discussions covered attitude shifts required for microbiome data sharing, what constitutes good metadata and other points.
Year(s) Of Engagement Activity	2022
URL	https://twitter.com/MicrobiomeData/status/1559210668485640194


Description	Proceedings presentation and poster at ISMB2021
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	"A machine learning framework for discovering and enriching metagenomics metadata from open access research articles" was accepted as proceedings presentation and poster at ISMB2021 (https://www.iscb.org/cms_addon/conferences/ismbeccb2021/tracks/textmining) and is currently available on ISCBtv (https://www.youtube.com/c/ISCBtv/featured)
Year(s) Of Engagement Activity	2021
URL	https://www.iscb.org/cms_addon/conferences/ismbeccb2021/tracks/textmining

Abstract

Technical Summary

Planned Impact

Organisations

People

ORCID iD

Publications