Integrating new statistical frameworks into eDNA survey and analysis at the landscape scale

Lead Research Organisation: University of Kent

Department Name: Sch of Maths Statistics & Actuarial Sci

Abstract

In recent years, three major innovations have occurred in ecology. (1) The emergence of new statistical methods for analysing community data; (2) the rapid detection of species and whole communities from environmental DNA (eDNA) and bulk-sample DNA; and (3) the wide availability of remotely sensed environmental covariates. The efficiency gains are such that hundreds or even thousands of species can now be detected and, to an extent, quantified in hundreds or even thousands of samples. Collectively, these three innovations have the potential to relieve the problems of data limitation and analysis that environmental management has been struggling with, opening the way to near-real-time tracking of state and change in biodiversity and its functions and services over whole landscapes.

The aim of our project is to develop an integrated statistical framework for DNA-based surveys of biodiversity. The framework will allow the estimation of community compositions and the identification of the landscape characteristics that drive them. We will develop a Bayesian hierarchical model accounting for the probabilistic nature of DNA-based data due to observation error and taxonomic uncertainty and for model uncertainty due to the unknown strength and direction of landscape effects on the system. We will build sophisticated and efficient algorithms within a Bayesian framework for identifying the important landscape covariates that predict community structure and provide guidelines on optimal allocation of resources in DNA-based surveys for achieving the required power to infer species distributions and to link them to landscape covariates.

The huge potential contribution of DNA-based data to landscape decision-making is demonstrated by how Natural England, Local Planning Authorities, and the NatureSpace Partnership use eDNA to create a biodiversity-offset market ('District Licensing') for the protected Great Crested Newt (GCN). Water samples from 500 ponds across the South Midlands (spanning ~3320 sq km) were tested for GCN and used to create a distribution map, which was then zoned into four 'impact risk' levels. Builders pay a known, sliding-scale fee, and a portion of the fee is used to build and manage new habitat. District Licensing is only feasible with eDNA's greater efficiency. GCN District Licensing expands to at least 16 LPAs in 2020, aiming to go nationwide, which would make it the largest biodiversity-focused, land-use decision scheme in the UK, if not the world.

The natural-and highly desirable-extension to the GCN scheme would be to map 'all biodiversity' and to make land-use decisions (e.g. impact risk maps, offset markets, habitat creation) on this broader basis. In fact, samples originally collected for GCN can be repurposed for this larger goal by using 'metabarcoding,' meaning that the eDNA is PCR-amplified for a larger range of taxa. Given the District-Licensing expansion plans, pond eDNA metabarcoding alone could provide an efficient way to map biodiversity across much of the UK.

This is far from the only such programme. Ecologists in industry and academia around the world are plunging ahead with large-scale DNA-sampling campaigns, and there is, as yet, no comprehensive set of statistical methods for modelling the individual steps of the new observation processes, quantifying the resulting uncertainty, and assessing how it affects decision-making at the landscape level. Our proposed modelling framework will provide such tools by explicitly capturing measurement bias within biodiversity models as a set of observation processes, and not merely as error. Improving sampling designs and workflows as a result of our proposed models will profoundly increase the efficiency and credibility of inference and therefore reduce the risk of biodiversity loss during the political process of allocating land to different uses.

Planned Impact

Research outputs for the beneficiaries

The outputs from the research will comprise (1) new statistical models for assessing uncertainties in the different stages of eDNA workflow, from sampling through to taxonomic assignment and inference of presence and absence of species; (2) user-friendly software in R-Shiny for research users to estimate these uncertainties and the landscape-level factors that influence them; (3) tools for optimising the design and delivery of landscape-level DNA-based surveys; (4) two Knowledge Exchange workshops and training events for research users to deliver (1)-(3).

Who could potentially benefit from these research outputs over different timescales?

Immediate beneficiaries (i.e. will benefit during the course of the research and immediately after): These comprise organisations that the research team are currently working with. These organisations are providing the source data for the project and have an interest in utilizing the project outputs (see support letters). These comprise government (e.g. Natural England) and non-government (e.g. Freshwater Habitats Trust, Amphibian and Reptile Conservation) agencies; environmental consultants (e.g. ARCESL), and private-sector service providers (e.g. NatureMetrics, NatureSpace). Members of the research team serve on the boards/advisory panels for all of these organisations.
Medium to long-term beneficiaries (i.e. will benefit within 5 years of the research concluding): These include the wider community of research users operating at the landscape scale, such as private-sector service providers, planning authorities, ecological consultants and non-governmental organisations.

How might the potential beneficiaries benefit?

Immediate beneficiaries: Through board and advisory panel membership, there will be an immediate route for dissemination, implementation and review of the research via these organisations. Immediate benefits will include: improved understanding of the uncertainties involved at different stages of the workflow and where resources need targeting to address these; improved understanding of the landscape-level variables that influence uncertainty; access to free scripts and R-Shiny software to improve design and analysis DNA-based surveys at the landscape level; opportunities to attend training workshops where the project outputs will be demonstrated and disseminated. A more detailed description of the mechanisms by which these will be achieved is in the Pathways to Impact attachment.

Medium to long-term beneficiaries: The mechanisms by which these beneficiaries will be reached are twofold. Firstly, there will be direct dissemination of the findings from the research via ongoing research-user workshops being run by members of the research team. For example, in September 2019, EM, RG and ASB ran a workshop at the University of Kent for their existing network of external research users. The workshop demonstrated new software (seak.shinyapps.io/eDNA), developed at the University of Kent and implementing novel statistical methods, developed by JG and EM, that for the first time quantifies uncertainty in single-species eDNA sampling and analysis at the landscape level. These workshops will continue at no cost to the grant through separate impact funding and ASB's joint postdoc position between Kent and the Amphibian and Reptile Conservation Trust (ARC). Secondly, there will be 'trickle down' benefits through the dissemination via the networks of the organisations working directly with the research team and/or attending the workshops. Collectively, the combination of the research team working directly with research users from different sectors and will ensure the rapid adoption of the research outcomes. Ultimately, this will result in much more cost-effective and reliable DNA sampling protocols with clear economic and societal benefits.

Funded Value:

£303,199

Funded Period:

Feb 20 - Sep 22

Funder:

UKRI

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

NE/T010045/1

Principal Investigator:

Eleni Matechou

Research Subject:

Agri-environmental science (20%)

Ecol, biodivers. & systematics (50%)

Mathematical sciences (30%)

Research Topic:

Community Ecology (20%)

Conservation Ecology (20%)

Earth & environmental (20%)

Population Ecology (10%)

Statistics & Appl. Probability (30%)

Organisations

People	ORCID iD
Eleni Matechou (Principal Investigator)	http://orcid.org/0000-0003-3626-844X
Jim Griffin (Co-Investigator)	http://orcid.org/0000-0002-4828-7368
Alex Bush (Co-Investigator)
Richard Griffiths (Co-Investigator)
Douglas Yu (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Buxton A (2022) Reliability of environmental DNA surveys to detect pond occupancy by newts at a national scale. in Scientific reports

Buxton A (2021) Optimising sampling and analysis protocols in environmental DNA studies. in Scientific reports

Diana A (2023) Fast Bayesian Inference for Large Occupancy Datasets in Biometrics

Diana A (2021) An RShiny app for modelling environmental DNA data: accounting for false positive and false negative observation error in Ecography

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Research Tools and Methods
Software and Technical Products
Engagement Activities


Description	We have already published three papers, have submitted one more and are currently drafting one more. Our work so far has challenged current practices of interpreting data resulting from DNA-based surveys by demonstrating the non-negligible probabilities of false positive and false negative error when the environmental samples are collected, eg soil, water etc, and when they are analysed in the lab. We have also contested the current approach that does not require repeat samples or repeat lab analyses and demonstrated the considerable improvement in inference when effort, and hence cost, is allocated differently between the sample collection stage and the lab stage than what is currently standard practice in the field of DNA-based surveys. Our new models and corresponding R package are already being used by practitioners around the world. We have now developed a multi-species model for DNA-based surveys and have submitted the corresponding paper to JASA, which is one of the top statistics journals in the world. We have given an a workshop at the International Statistical Ecology conference in Cape Town in June and we have been awarded a Knowledge Transfer Partnership in collaboration with NatureMetrics, who are interested in incorporating our new models within their bioinformatics pipeline. Finally, we have applied for NERC's pushing the frontiers of environmental science research funding to explore the applications of our new models to studying soil microbiomes, lotic system communities and benthic communities.
Exploitation Route	We have developed R code for all our new models that is freely available and we have given a number of workshops, both here and abroad to help with dissemination. We have been approached by several researchers and other organisations who are interested in using our new models and we are now collaborating with researchers in the UK and in Italy.
Sectors	Agriculture Food and Drink Environment
URL	https://blogs.kent.ac.uk/edna/


Description	We have been awarded a Knowledge Transfer Partnership with NatureMetrics who are interested in incorporating our new modelling framework within their bioinformatics pipeline, essentially changing their product so that it also includes output from our model. As part of this collaboration, we have developed new models and tools for study design and we have worked with organisations, such as WWF, to help them design their eDNA studies.
First Year Of Impact	2022
Sector	Environment
Impact Types	Societal Economic


Description	University of Kent and NatureMetrics KTP 21_22R5
Amount	£99,607 (GBP)
Organisation	Innovate UK
Sector	Public
Country	United Kingdom
Start	08/2022
End	09/2023


Title	eDNAPlus R package
Description	We have developed a new R package and associated workshop material for fitting the models developed as part of this project https://blogs.kent.ac.uk/edna/metabarcoding-data/download/
Type Of Material	Improvements to research infrastructure
Year Produced	2022
Provided To Others?	Yes
Impact	We have had 80+ participants in our workshops and a number of those researchers are now using our R package for analysing their own data.
URL	https://blogs.kent.ac.uk/edna/metabarcoding-data/download/


Title	Malaise-trap metabarcoding dataset from temperate-zone forest Oregon, USA
Description	DNA-based biodiversity surveys involve collecting physical samples from survey sites and assaying the contents in the laboratory to detect species via their diagnostic DNA sequences. DNA-based surveys are increasingly being adopted for biodiversity monitoring and decision-making. The most commonly employed method is metabarcoding, which combines PCR with high-throughput DNA sequencing to amplify and then read `DNA barcode' sequences. This process generates count data indicating the number of times each DNA barcode was read. However, DNA-based data are noisy and error-prone, with several sources of variation. In this paper, we present a unifying modelling framework for DNA-based survey data, eDNAPlus, for the first time simultaneously allowing for key sources of variation, error and noise in the data-generating process. As we discuss, metabarcoding data alone cannot be used to estimate the species-specific amount of DNA present, or DNA concentration, at surveyed sites. Instead, we estimate changes in DNA biomass within species, across sites, and link those changes to environmental covariates, while accounting for between-species and between-sites correlation. Inference is performed using MCMC, where we employ Gibbs or Metropolis-Hastings updates with Laplace approximations. We further implement a re-parameterisation scheme, appropriate for crossed-effects models, leading to improved mixing, and an adaptive approach for updating latent variables, which reduces computation time. We discuss study design and present theoretical and simulation results to guide decisions on replication at different survey stages and on the use of quality control methods. Finally, we demonstrate the new framework on a dataset of Malaise-trap samples. Specifically, we quantify the effects of elevation and distance-to-road on each species, infer species correlations, and produce maps identifying areas of high biodiversity and species DNA biomass, which can be used to rank areas by conservation value. We also estimate the level of noise between sites and within sample replicates, and the probabilities of error at the PCR stage, which are found to be close to zero for most species considered, validating the employed laboratory processing.
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
URL	https://datadryad.org/stash/dataset/doi:10.5061/dryad.4f4qrfjjb


Title	eDNAPlus
Description	DNA-based biodiversity surveys involve collecting physical samples from survey sites and assaying the contents in the laboratory to detect species via their diagnostic DNA sequences. DNA-based surveys are increasingly being adopted for biodiversity monitoring. The most commonly employed method is metabarcoding, which combines PCR with high-throughput DNA sequencing to amplify and then read `DNA barcode' sequences. This process generates count data indicating the number of times each DNA barcode was read. However, DNA-based data are noisy and error-prone, with several sources of variation. In this paper, we present a unifying modelling framework for DNA-based data allowing for all key sources of variation and error in the data-generating process. The model can estimate within-species biomass changes across sites and link those changes to environmental covariates, while accounting for species and sites correlation. Inference is performed using MCMC, where we employ Gibbs or Metropolis-Hastings updates with Laplace approximations. We also implement a re-parameterisation scheme, appropriate for crossed-effects models, leading to improved mixing, and an adaptive approach for updating latent variables, reducing computation time. We discuss study design and present theoretical and simulation results to guide decisions on replication at different stages and on the use of quality control methods. We demonstrate the new framework on a dataset of Malaise-trap samples. We quantify the effects of elevation and distance-to-road on each species, infer species correlations, and produce maps identifying areas of high biodiversity, which can be used to rank areas by conservation value. We estimate the level of noise between sites and within sample replicates, and the probabilities of error at the PCR stage, which are close to zero for most species considered, validating the employed laboratory processing.
Type Of Material	Data analysis technique
Year Produced	2022
Provided To Others?	Yes
Impact	Several researchers have been using this new modelling framework to analyse their data and to design their data collection approach.
URL	https://arxiv.org/abs/2211.12213


Title	H.J. Andrews Malaise-trap metabarcoding dataset, Session 1
Description	These are the scripts to prepare the input data files for the eDNAPlus software (https://github.com/alexdiana1992/eDNAplus). Data file origin: We collected 121 Malaise-trap samples from 89 sample sites in and around the HJ Andrews Experimental Forest, Oregon. Each sample was subjected to the Begum metabarcoding pipeline (described in Yang, C.Y., Bohmann, K., Wang, X.Y., Wang, C., Wales, N., Ding, Z.L., Gopalakrishnan, S., Yu, D.W. (2021) Biodiversity Soup II: A bulk-sample metabarcoding pipeline emphasizing error reduction. Methods in Ecology and Evolution 12:1252-1264. doi: 10.1111/2041-210X.13602.). In short, each sample was DNA-extracted and then PCR-amplified for a 313 base-pair fragment of the COI DNA-barcode gene (using Leray-FolDegenRev primer pair, described in Yang et al. 2021). In the Begum pipeline, each sample is independently PCRd three times and then library prepped and sequenced on an Illumina sequencer (amplicon sequencing). Finally, we processed the Illumina output files to trim low-quality sequences, merge read pairs, and assign the orginating sample name to each read. Unlike in the standard Begum pipeline, we did not use the 3 separate PCRs per sample to detect and filter out erroneous sequences. Instead, we accepted all reads (i.e. by filtering at the no stringency: accepting reads if they appeared in =1 PCR with =1 read). This resulting fasta-format read dataset is the starting dataset for the custom scripts in this archive. This custom scripts generate 3 separate OTU tables (sample x species tables) from the 3 separate PCRs. The 3 OTU data tables are the raw inputs to eDNAPlus. The scripts and data files are also uploaded to datadryad.org as a single zip file.
Type Of Technology	Software
Year Produced	2023
Open Source License?	Yes
URL	https://zenodo.org/record/8220862


Description	Workshops
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	We have delivered multiple workshops on modelling eDNA data, both in the UK and abroad. We have recorded the sessions, which are still being watched by researchers who want to learn about our new models and use our R packages for their data. https://blogs.kent.ac.uk/edna/workshops/
Year(s) Of Engagement Activity	2021,2022
URL	https://blogs.kent.ac.uk/edna/workshops/

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications