Data integration for large scale ecological models

Lead Research Organisation: NERC Centre for Ecology and Hydrology
Department Name: Biodiversity (Wallingford)


Ecological models are becoming larger, more complicated, and being used for an increasingly wide range of applications, from describing trends and mapping distributions to understanding mechanistic relationships and predicting the impact of future scenarios. In response, there has been a huge growth in statistical methods for large-scale ecological models. However, most such methods do not account for the fact that ecological data is inherently heterogeneous, and large datasets typically contain many forms of bias.

Recently, a set of hierarchical Bayesian models (HBMs) have emerged as promising ways for dealing with biased data, particularly for occurrence records and other unstructured data. Many millions of unstructured occurrence records exist, so the potential of these new methods is enormous.

Not all data contain biases, though. A minority of biodiversity data is highly structured in terms of the sample locations, fixed protocols and regular sampling. Ideally, we'd like to retain the information about this in our models, but combine it with the much larger sample sizes of unstructured datasets.

Integrated models provide a way to do this. They are a subclass of HBM in which data heterogeneity is modelled explicitly, by treating datasets with different observation processes as independent realisations of the same underlying state. For example, causal observations on GBIF and the Breeding Bird Survey both contain information about whether the population of a particular species was extant at a particular point in space and time.

At present, these integrated models are the preserve of highly competent statisticians. They are hard to specify and difficult to fit and diagnose. One goal of this partnership is to build an extensible framework for fitting integrated models that will make them accessible to a broad community of ecological modellers. This framework, in the form of open source tools, will make it easier for ecologists to handle biased data when addressing large-scale questions about biodiversity.

Although attractive from a conceptual standpoint, it is unclear whether the sophistication of integrated models deliver real benefits over simple ones. In particular there is an urgent need for some general principles about how to proceed when both structured and unstructured data sources are available. Critical questions include:
Q1. When and how should we combine datasets with different properties?
Q2. Under what circumstances is simple aggregation (i.e. ignoring the different observation processes) better than integration?
Q3. If we suspect the data contain biases, can we detect them and handle them adequately?
Q4. What are the most appropriate metrics for information content and model fit?

These general questions lie at the intersection of the research interests of PI Isaac, Co-I Henrys and Project Partner O'Hara. Each has made some progress towards addressing specific aspects of these questions. Working in partnership would add significant value to each, by taking existing research beyond the specific context and toward general answers to these big questions. It would permit a co-ordinated effort and build a work program of international significance.

This pump-priming award would provide a platform for this partnership. The overall aim is to build a framework for inference in large-scale models of species' distribution, and to test it using computer simulations.

Planned Impact

Three types of non-academic activity will benefit from the research described in this proposal.

1. Design of biodiversity monitoring programs
Large-scale structured biodiversity monitoring is expensive: ensuring cost-effective survey design is a major priority for the agencies that commission such research, particularly in the current climate of government austerity. In the UK, the principle agencies are JNCC, Defra, Natural England, Scottish Natural Heritage etc. These agencies are attracted to the potential of citizen science and opportunistic recording to generate large volumes of data at relatively low cost. However, the value of these data types is questionable, due to the lack of structure in how data are collected. Integrated modelling provides a way to combine these unstructured data with more traditional structured surveys. To some extent this has already happened: the new Defra-funded Pollinator Monitoring Scheme has been designed with elements of structured and unstructured data observation processes, using a mixture of professional surveyors and citizen scientists. The 'rules of thumb' arising from this project will make it possible design cost-effective biodiversity surveillance schemes in which stratified random sampling using formal protocols by professional scientists can be augmented by large-scale observations by citizen scientists (and vice versa). In this way, the outcomes of this research project will support decisions about future scheme designs, both in the UK and internationally.

2. Reporting on biodiversity targets
Biodiversity indicators are a key tool for reporting against national targets and international treaty obligations (including the "Aichi targets"). Currently, the UK has eleven biodiversity indicators that report on the status of species, of which nine use data from structured surveys and two use unstructured occurrence records (both new indicators were developed by the project team). For many species there are multiple datasets available, but there is no obvious way to combine them. By providing clear guidance on data integration, our research will ensure that biodiversity indicators make the best use of available data and prevent arbitrary choices are being made about which data to use. This will be a welcome development for agencies with responsibility for delivering biodiversity indicators, both in the UK (JNCC and Defra) and internationally.

3. International Networks
The biodiversity crisis is a global phenomenon, and biodiversity itself does not respect international borders. For this reason, there is a need to coordinate responses to the biodiversity crisis across nations, in which IPBES and GEO-BON have an important role to play. Part of this role involves synthesising large quantities of information about biodiversity from different countries. Data integration, and the development of models that facilitate such integration, is necessary to form a coherent narrative at the global scale. The development of Essential Biodiversity Variables (EBVs), led by GEO-BON, is one way in which synthesis can be achieved. PI Isaac recently contributed to the design of a roadmap for building EBVs for species' distribution and abundance at the global scale, in which integrated models (to account for data heterogeneity) were highlighted as a key knowledge gap.


10 25 50
Description Integrated modelling of species distributions and abundance is emerging as a powerful tool in statistical ecology, and are expected to underpin the next generation of models predicting the current, future and potential distributions of species. Point processes provide a flexible framework for developing integrated models, combining data representing the locations of individual organisms, local population abundance and species-site occupancy. In this project, we developed methods that provide opportunities to make best use of existing and new data sources. We assessed the value of data integration over conventional approaches, and evaluated, using simulations, the situations when data integration is likely to be beneficial.
Exploitation Route Integrated models are currently the preserve of statisticians. Our work makes these developments accessible to a broad set of non-specialists. Research conducted in this project underpins a series of integrated modelling frameworks and new schemes for monitoring the state of the environment.
Sectors Environment

Description Stakeholders in government agencies and NGOs, both in UK and internationally, are using the insights from this work to develope new wildlife monitoring schemes and data storage architectures.
First Year Of Impact 2020
Sector Environment
Impact Types Policy & public services

Description Integrated modelling framework adopted by EU pollinator monitoring scheme
Geographic Reach Europe 
Policy Influence Type Participation in a advisory committee
Description ARIES DTP studentship
Amount £75,000 (GBP)
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 10/2020 
End 03/2024
Description GLobal Insect Threat-Response Synthesis (GLiTRS): a comprehensive and predictive assessment of the pattern and consequences of insect declines
Amount £902,701 (GBP)
Funding ID NE/V007548/1 
Organisation Natural Environment Research Council 
Sector Public
Country United Kingdom
Start 11/2020 
End 11/2024
Description Pollinator Monitoring Scheme 
Organisation Department For Environment, Food And Rural Affairs (DEFRA)
Country United Kingdom 
Sector Public 
PI Contribution The pollinator monitoring scheme is a new program for monitoring the status of pollinating insects across the UK. My research team is responsible for developing the statistical modelling framework for reporting trends for each species. We have based this framework on insights gained from the "Data integration" project: there is a systematic survey and an extensive unstructured recording scheme, both of which generate valuable information.
Collaborator Contribution Taxonomic expertise; Survey design; coordination; access to networks of volunteer citizen scientists; communications
Impact Design of a pan-European monitoring scheme built on integration of multiple evidence streams: Cost-benefit analysis showing that pollinator monitoring more than pays for itself:
Start Year 2018
Title Integrated analysis of black-throated blue warbler data from PA, USA 
Description This product is R code to run an integrated distribution model in R-INLA, using the principles developed under this project and which are described in our paper in Trends in Ecology & Evolution. This worked example contains all the steps required to download the data, fit the model and display the outputs. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact The code developed here is a case study in the paper we published in early 2020 (see "publications") 
Description Presented at Living Norway symposium 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Living Norway seminar ran over two days in June 2019 as the launch even the Living Norway Ecological Data Network. The network spans academics, government agencies and volunteer groups in Norway. The meeting was also attended by data holders from other Scandinavian countries. In my talk, I explained how the principles of data integration work in an ecological context, showing how well-designed models and databases make it possible to bring multiple data types to bear on questions about how biodiversity is distributed
Year(s) Of Engagement Activity 2019