Imperfect data: accuracy, impacts and extraction of meaningful information

Lead Research Organisation: University of Nottingham


Meaningful information is a fundamental requirement for informed, logical and reasoned activity. Extracting meaningful information from data can, however, be a challenge, especially given problems that data may, amongst other things, be inaccurate, incomplete, and possibly contradictory as arise from a variety of sources of variable quality and trust level.

Data imperfections are a generic problem in information extraction and decision making and so the work is relevant in many disciplines. Imperfect data are, for example, evident in medical diagnosis (e.g. a patient's test results are typically only an imperfect indicator of a condition), in defining nature reserves for species conservation (e.g. the species distribution maps and models are often highly sensitive to 'absence' data - was the species actually present but not observed?) and in security and defence applications (e.g. sub-pixel target detection algorithms applied to surveillance imagery vary in performance and utility between environments). Some problems with imperfect data were recently highly apparent in relation to the response to the Haiti earthquake of 2010, especially in relation to damage mapping to inform relief activities. Vast amounts of well-intentioned assistance was provided by numerous professional and amateur bodies with unprecedented data rates but the volumes of data and the problems with them were a concerns. Key problems were that maps were inaccurate, inconsistent and sometimes contradictory. As such a major mapping challenges arises in how to work with such data. One key issue is the need for information on the accuracy of data sources and methods to help use imperfect data. This project seeks to contribute to this task. It aims to illustrate the impacts of using imperfect data, explore methods to characterise the quality of the data and methods to combine data sources to yield an enhanced product of known accuracy.

A range of methods will be used but the core focus is on the use of latent class modelling. This type of analysis is based on multiple observations or data from a variety of sources. The relationships between the observers/data sources are used to attempt to explain their quality and suggest how the data could be interpreted to yield information. The approach is a form of statistical modelling and is highly attractive for the specific research proposal because if a model can be formed that fits the observed data, then model's parameters define the accuracy of the data sources and its outputs can be used to form new products of known accuracy. As such the modelling analysis may add value to data by indicating its quality and combining it usefully for extraction of information.

As the problems of imperfect data are generic the proposal has broad potential impacts. For the specific DaISy call there are clear impacts in relation to security and defence. For example methods that enable rapid and qualified information to be derived from sources of variable accuracy, completeness and trust level will increase effectiveness and the quality of decision making. Additionally as a model based approach it removes/reduces the need for reference data to be acquired for validation which could otherwise require deployment of personnel to dangerous locations and so of considerable benefit to health and well-being.

Planned Impact

Many should benefit from the proposed research given the generic nature of the topic. There is a direct pathway to impact via the links with a key end-user of the research, Dstl. This pathway benefits from the reporting requirements that form natural milestones to disseminate current results but also inform new analyses to tailor the project to the Dstl specific needs as the project progresses.

The project responds to specific parts of the DaISy call; with work focused especially on concerns with the use of imperfect data (e.g. problems of inaccuracy, incompleteness, bias, duplication etc.). The work will impact on defence activities in a number of ways as the research is, essentially, exploring means to add value to data - notably by providing information on its accuracy and in combining data sets of unknown accuracy and trust level from sources of varying authority into a new product whose accuracy is also defined.

The project will also illustrate how imperfect data may be used to provide information on an unobserved (and potentially unobservable) variable(s). For example, data on observable variables (e.g. dress, gait, cleanliness etc) may be indicators of behavioural traits that could be of value to security applications such as helping to identify and prioritise individuals for detailed assessment at passport control at border crossing locations. Thus, the methods could aid the targeting of resources and help enhance the safety of the UK and its citizens.

Similarly, by revealing an unobserved variable the approach may be of value in revealing the characteristics of individuals to help ensure that each is deployed in suitable roles or identify latent characteristics to define future training needs and inform development programmes. This could greatly increase the efficiency and effectiveness with which policies are enacted.

There are numerous potential impacts beyond the defence and security sector. This includes impacts that will:
- Increase the effectiveness of activities linked to major national and international policies. For example, imperfect deforestation data currently limit major international policy related programmes such as UN REDD+.
- Increase the engagement of public with research by involving them in data acquisition. Enhancing engagement with the public may also help to enhance their quality of life.
- A move from design-based to model-based inference would allow data validation to be undertaken more inexpensively increasing the potential for enhanced economic competitiveness.
- The adoption of model-based validation also removes/reduces the need for data collection for validation which may reduce the need to deploy personnel in potentially dangerous environments, enhancing health and well-being.
- Enhance human health and well-being by enhancing disaster relief activities that help provide rapid and accurate information for planning and executing response actions.

Although the proposal is exploratory it should lead to results that should be disseminated widely in the normal manner. Thus, subject to approval from Dstl, it is proposed to disseminate the results through normal academic channels. It is anticipated that this would involve at least: two major conference presentations and journal articles. Three papers are currently anticipated and journal selection will be based, in part, on ensuring a wide dissemination, which will be enhanced through the use of open access outlets.

Finally, as it is intended to derive some data directly from citizen sensors, it is also intended to provide each contributor with feedback on their performance relative to others (anonymously to conform to confidentiality and ethical concerns). This should enhance the volunteer's role in the project as well as their appreciation of the research and its implications. Critically it should help engage the public more directly in the work and lead to improved future performance.
Description The key finding is that, in some circumstances, the quality of imperfect data may be estimated from the data set itself. Also very good estimation of properties of interest can be obtained from imperfect data. Critically, the data themselves indicate their quality - no need for expensive and difficult to collect reference data.
Exploitation Route The project help define how the quality of imperfect data sets can be assessed and how such data sets may be usefully used. The focus has been on crowdsourcing and is being taken up in a new project to be funded by EU Horizon (LandSense - let by IIASA Austria). Through the latter the work may now impact in a variety of subject areas.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Other

Description The findings have informed the development of good practices for use in other crowdsourcing projects. Only small impacts to-date as numbers of contributors used low but pushing to increase so that the latent class model may be used more fully.
First Year Of Impact 2018
Sector Environment
Impact Types Societal

Description LandSense 
Organisation International Institute for Applied Systems Analysis
Country Austria 
Sector Academic/University 
PI Contribution Builds on work funded by EPSRC plus related work, eg. COST Action. Critically it includes work on using imperfect crowdsourced data.
Collaborator Contribution IIASA lead the consortium Please note the award has not technically been made and is still in the stage of negotiation; what is indicated is approximate.
Impact Have co-authored papers> New project, direct input started in 2016 as leader WP5. RA just started in Feb 2017 and is focused on quality issues connected to imperfect data.
Start Year 2013