Missing Data as Useful Data
Lead Research Organisation:
University of Glasgow
Department Name: School of Geographical & Earth Sciences
Abstract
Timely yet safe decisions require real-time ingestion and assimilation of data pertaining to system dynamics, cognisant of incompleteness, uncertainty, inherent under/over-representation, and bias. This fellowship will devise and implement novel procedures for accommodating the source and operation of missingness and biases in 'found' or "new forms of" data, such as social media and mobile phone data, and propose new ways of triangulating them with traditional statistical sources. It will provide generic and novel methods to use "new forms of data" in ways that are efficient, effective, and safe to use.
Our "digitalised" lives and the popularity of social media, ubiquitous sensors, and gadgets have provided us with an unprecedented opportunity to understand society, the economy, wellbeing, and the physical world at a much higher frequency than traditional surveys and polls. However, this information will normally be obtained with uncontrolled recording mechanisms, e.g. observationally, presenting challenges around over/under-representation and biases, missingness, sparsity, and latent dependencies. This inhibits these sources from being integrated effectively with traditional data sources to build a richer, more comprehensive, resource to build and train the latest statistical and cutting-edge deep learning AI models. Any decisions, patterns, and models that arise from such data-even if they constitute the majority of the population-can overlook the needs of those who do not participate. The foundational statistics that will be developed based on proof-of-concept evidence delivered in the first phase of the fellowship, will be applied to a wide range of applications and disciplines, including policing (e.g. under-reported crime, hidden online harms), social care, public health, inclusive city planning, aligned with Office for National Statistics (ONS) strategy, the dynamic census.
This fellowship will develop novel models and frameworks based on a paradigm-shifting perspective that considers or even use biases, sparsity, and missing data as useful data. It will provide solutions and mechanisms to reliably use "whole datasets" and integrate user-generated data and traditional survey data to have meaningful, realistic, and timely data-driven policies and decisions.
Novelty:
- Considering "new forms of data" as useful data to be integrated/triangulated with traditional data to provide a reliable, timely, and updated understanding of the systems can open up a wide range of applications that are nationally important and strategic, including managing under-reported crime, better social care and protection of society, inclusive city planning, and dynamic census using administrative and alternative data.
- Considering missingness as useful data, enabling the use of both available and unavailable data to compensate for the missing data.
- Providing an effective procedure to combine new forms of data with traditional datasets with quantifiable measures for quality and fitness for purpose.
- Ethical, legal, and liability considerations of using new forms of data, such as ethics of data we do not have, can open a wider discussion about the ethical, legal, security, fairness, reliability, safety, transparency, and accountability. While it improves inclusivity and makes the unheard more visible, the ethical questions regarding agency, privacy and wider benefit of data.
This fellowship will support me to establish my growing team and my area of research to deliver world-class fundamental and applied research involving "new forms of data". In doing so, I deliver a suite of methods and mechanisms that enable the effective use of non-standard data sources (potentially in conjunction with traditional data) to maximise benefits and deliver a (near) real-time understanding of cities, and societies.
Our "digitalised" lives and the popularity of social media, ubiquitous sensors, and gadgets have provided us with an unprecedented opportunity to understand society, the economy, wellbeing, and the physical world at a much higher frequency than traditional surveys and polls. However, this information will normally be obtained with uncontrolled recording mechanisms, e.g. observationally, presenting challenges around over/under-representation and biases, missingness, sparsity, and latent dependencies. This inhibits these sources from being integrated effectively with traditional data sources to build a richer, more comprehensive, resource to build and train the latest statistical and cutting-edge deep learning AI models. Any decisions, patterns, and models that arise from such data-even if they constitute the majority of the population-can overlook the needs of those who do not participate. The foundational statistics that will be developed based on proof-of-concept evidence delivered in the first phase of the fellowship, will be applied to a wide range of applications and disciplines, including policing (e.g. under-reported crime, hidden online harms), social care, public health, inclusive city planning, aligned with Office for National Statistics (ONS) strategy, the dynamic census.
This fellowship will develop novel models and frameworks based on a paradigm-shifting perspective that considers or even use biases, sparsity, and missing data as useful data. It will provide solutions and mechanisms to reliably use "whole datasets" and integrate user-generated data and traditional survey data to have meaningful, realistic, and timely data-driven policies and decisions.
Novelty:
- Considering "new forms of data" as useful data to be integrated/triangulated with traditional data to provide a reliable, timely, and updated understanding of the systems can open up a wide range of applications that are nationally important and strategic, including managing under-reported crime, better social care and protection of society, inclusive city planning, and dynamic census using administrative and alternative data.
- Considering missingness as useful data, enabling the use of both available and unavailable data to compensate for the missing data.
- Providing an effective procedure to combine new forms of data with traditional datasets with quantifiable measures for quality and fitness for purpose.
- Ethical, legal, and liability considerations of using new forms of data, such as ethics of data we do not have, can open a wider discussion about the ethical, legal, security, fairness, reliability, safety, transparency, and accountability. While it improves inclusivity and makes the unheard more visible, the ethical questions regarding agency, privacy and wider benefit of data.
This fellowship will support me to establish my growing team and my area of research to deliver world-class fundamental and applied research involving "new forms of data". In doing so, I deliver a suite of methods and mechanisms that enable the effective use of non-standard data sources (potentially in conjunction with traditional data) to maximise benefits and deliver a (near) real-time understanding of cities, and societies.
Organisations
Publications

Joseph Shingleton
(2024)
Where is the News? Improving Toponym Identification and Differentiation in Online News

Middleton S
(2024)
AI for Defence: Readiness, Resilience and Mental Health
in The RUSI Journal

Shin H
(2025)
Diagnosing Spatial and Temporal Biases of OSM Contributors: Identifying Differences Between Gender and Age from an Online Survey
in Annals of the American Association of Geographers

Shingleton J
(2024)
Enhancing toponym identification: Leveraging Topo-BERT and open-source data to differentiate between toponyms and extract spatial relationships
in AGILE: GIScience Series

Solomon G
(2024)
Evaluating geotemporal behaviours of OpenStreetMap contributors
in AGILE: GIScience Series


Yuan X
(2025)
Construction enthusiasts versus demolition giants: Insights from building footprint data in England
in Environment and Planning B: Urban Analytics and City Science
Description | Disabled just assorted 10 months ago. While we are still waiting for more results, we have seen a geographic distribution of missing data to be varying against deprivation, education level and gender. This clearly means there is a correlation between the reported missingness and on the reported values and some of the reason that caused this on the report. We are hoping our statistical model will be able to address this in the next year |
Exploitation Route | By sharing the results when all the analysis are completed, allow the social science and survey research community as well as office for national statistics to be able to design better framework than minimise the missing data while protecting the characteristics of participants |
Sectors | Digital/Communication/Information Technologies (including Software) |
Description | I serve and advise several national and governmental offices, including the Office for National Statistics' Statistics (ONS) Methodological Assurance Review Panel (https://uksa.statisticsauthority.gov.uk/the-authority-board/committees/nationalstatisticians-advisory-committees-and-panels/methodological-assurance-reviewpanel/#pid-prof-ana-basiri) as well as Co-chairing the Scottish Government's ScotStat Board (https://blogs.gov.scot/statistics/2024/02/02/join-the-new-scotstat-board/). As the only shared member of the two panels in both Scottish and the UK statistical units, my contribution resulted in the agreement - in the first time in the history of official statistics in the UK- to deliver produce population and migration estimates for four devolved nations consistently using new forms of data. for the first time in the history of official statistics in the UK. This is significantly important from two perspectives: • Future of Population and Migration Statistics to use "New forms of Data" instead (or aligned to partial) decennial census will make the UK one of the only three nations which will have administrative and new forms of data as the basis of population estimate (instead of census) • Create a pilot for future consistent data sharing and official statistics in the UK for the first time. |
First Year Of Impact | 2024 |
Sector | Digital/Communication/Information Technologies (including Software) |
Impact Types | Societal Economic Policy & public services |
Description | Co-Chair of ScotStat Board |
Geographic Reach | National |
Policy Influence Type | Implementation circular/rapid advice/letter to e.g. Ministry of Health |
Impact | Our goal, as statisticians, is to improve data driven decision making in the public sector. To do this we need to be at the top of our game on communicating our statistics. Our published statistics must be relevant to inform public debate. And we have the power, through our data and analysis, to shine a light on environmental and societal issues that would otherwise remain hidden. Statisticians working on individual topic areas regularly engage with their users to guide their work. However, it is also important that we reflect on the statistical system as a whole. And so, to support the country analytical colleagues across the Public Sector, the Govt established a new and refreshed "ScotStat Board for Official Statistics". |
URL | https://blogs.gov.scot/statistics/2024/02/02/join-the-new-scotstat-board/ |
Description | Office for National Statistics' Methodological Assurance Review panel |
Geographic Reach | National |
Policy Influence Type | Contribution to a national consultation/review |
Impact | We provided independent assurance and guidance on the statistical methodology underpinning 2021 census estimates and those based on administrative sources, Identify significant gaps and risks in methods and make suggestions for mitigation, and reviewed admin data methods and contribute to their continuous improvement. This has resulted in a public consultation to change the future of statistics and population estimation in the country to use "new forms of data". |
URL | https://uksa.statisticsauthority.gov.uk/the-authority-board/committees/national-statisticians-adviso... |
Description | Sustainable AI Report |
Geographic Reach | Multiple continents/international |
Policy Influence Type | Contribution to a national consultation/review |
Impact | I contributed to a report on Engineering Responsible AI: foundations for environmentally sustainable AI, was developed by the Royal Academy of Engineering in partnership with the Institution of Engineering and Technology and BCS, the Chartered Institute of IT, under the National Engineering Policy Centre (NEPC). This was presented in AI summit in Paris, that provided the best practises and policies that are require in order to ensure the artificial intelligence remains sustainable. This also received attention and mention in several articles by the BBC and the guardian. And ultimately fed into the creation of the United Nations sustainable AI working group |
URL | https://raeng.org.uk/media/2aggau2j/foundations-for-sustainable-ai-nepc-report.pdf |
Description | giving evidence to Scottish parliament on AI regulations - Review of the UK-EU Trade and Cooperation Agreement |
Geographic Reach | Multiple continents/international |
Policy Influence Type | Contribution to a national consultation/review |
Impact | As part of its ongoing work on UK-EU relations including the Trade and Co-Operation agreement, the CEEAC Committee has agreed to hold an evidence session on AI on the morning of 13th March when I starve as an expert to discuss the issue Review of the UK EU Trade an Cooperation Agreement Inquiry. I also contributed to a submission in advance led by RSE. The purpose of the session was to explore relevant issues such as the regulatory environment for AI within the context of future EU-UK relations. |
URL | https://www.parliament.scot/chamber-and-committees/committees/current-and-previous-committees/sessio... |
Title | How close is close? |
Description | This is a data set and methodology to understand the concept of proximity by large language models |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2025 |
Provided To Others? | Yes |
Impact | This is the day to the methodology for public access in order to understand some of the spatial relationships such as proximity that are challenging for large language mortgages. https://osf.io/r3ep7/?view_only=b4783a1d8dcc47109650a7eff0bc662e |
URL | https://osf.io/r3ep7/?view_only=b4783a1d8dcc47109650a7eff0bc662e |
Title | location estimation using image |
Description | This is a novel method as well as large data sets that are used to provide a location estimation technique based on visual transformers that localise and image taken by an ordinary person. This is massively important for localisation and positioning where GPS is not available such as araband environments where the high-rise building may block the signal. However, it required relatively large data to train and a novel solution. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2024 |
Provided To Others? | Yes |
Impact | https://huggingface.co/spaces/yunusserhat/Location_Predictor |
URL | https://huggingface.co/spaces/yunusserhat/Location_Predictor |