Dependence Modelling with Vine Copulas for the Integration of Unstructured and Structured Data

Lead Research Organisation: Plymouth University
Department Name: Sch of Eng, Comp and Math (SECaM)

Abstract

The project will develop a statistical data integration methodology, never considered before, that utilizes multiple sources of information to provide more accurate predictions than those currently available. Today we are living in the Big Data era, where masses of data in traditional formats are produced by companies and organizations and large quantities of information, mostly unstructured, are generated by social media, every second. However, are we effectively and efficiently exploiting all the information available to us from official and social media sources? The answer to this question is definitely, no. Most of the statistical approaches used to solve real-world problems are based on a single source of information and, although preliminary work attempting to leverage social media data exists, there are currently no comprehensive and functional methodologies able to fully capitalize on unstructured information and its associations with other available structured data. The consequence is that precious information contained in unstructured online data continues to be neglected and lost. While technology and digitalization advances are shaping the world, statistics is struggling to keep pace and it is currently in critical and urgent need of revolutionizing its methods and practices. This proposal aims at filling this gap, giving life to a pioneering and transformative statistical data integration methodology, fully leveraging the power of different sources of information, such as traditional and online-generated data. The project will support early-stage research on integrating unstructured and structured data using a new methodology based on vine copulas that will form the basis of future analyses, which will lead to a radical transformation of current data approaches, propelling statistics towards the future era. For this research, which is early-stage, yet will bring immediately usable results, the methodology will be applied to data of crimes committed in the South West region of the UK, integrating official police information, provided by our project partner Devon and Cornwall Police (DCP), with crime data discussed on different social media platforms. Our approach will provide a more thorough and realistic appraisal of the volume and severity of crimes in specific locations of the South West, since it will also account for hidden crimes, unreported to the police, but emerging from social media. The results of this project will be used by DCP to more effectively plan and organize their interventions and to efficiently allocate resources in targeted areas. Providing a deeper and more accurate knowledge of the geographical locations of criminal offences, including unreported crimes, this project will assist the police to better support communities in high criminal risk areas with timely interventions, making people feel more protected and safer. This will promote social inclusion and more equitable communities, especially in disadvantaged areas that are mostly affected by high criminality levels, including crimes which are not reported via traditional channels.
This project, initially targeting the South West of the UK, will lay the foundation for future grant applications extending the geographical area under assessment at national level.
In addition, due to the endless number of possible applications of our methodology, this project will be the milestone that will generate further breakthroughs in any other area of science where multiple data sources are available and accurate predictions are needed.
This project is timely since it addresses the urgent need to fully leverage the social media information currently available, but not taken advantage of. This research will provide a key opportunity for the UK to secure a leading international position at the forefront of advances in knowledge extraction, leading to huge social and economic benefits.
 
Description The work undertaken lead to the application of data integration research methods for structured historical data and unstructured social media data. The next steps will include the development of novel statistical methodology using improved statistical tools.
Exploitation Route Academics working in applied fields will be able to use our results to integrate structured and unstructured data in various contexts.
Industrialists will be able to applied the developed methods and algorithms in their area of application.
Sectors Communities and Social Services/Policy

Digital/Communication/Information Technologies (including Software)

Energy

Environment

Financial Services

and Management Consultancy

Healthcare

 
Description The methods and algorithm developed as part of the work undertaken, have contributed to the society by illustrating members of the public how social media information can be used to improve knowledge of, e.g, environmental and health phenomena.
Impact Types Societal

 
Description Research Reboot Grant
Amount £1,000 (GBP)
Funding ID 42112 
Organisation London Mathematical Society 
Sector Academic/University
Country United Kingdom
Start 05/2022 
End 06/2022
 
Description Short-Term Scientific Mission
Amount € 1,440 (EUR)
Organisation European Cooperation in Science and Technology (COST) 
Sector Public
Country Belgium
Start 05/2022 
End 06/2022
 
Description Short-Term Scientific Mission
Amount € 2,000 (EUR)
Organisation European Cooperation in Science and Technology (COST) 
Sector Public
Country Belgium
Start 05/2023 
End 07/2023
 
Title BICC R package 
Description The BICC R package implements all the methodologies presented in the paper by Grazian, Dalla Valle & Liseo (2022) Approximate Bayesian Conditional Copulas. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact The code included in the BICC R package allows to model the dependence of complex phenomena and has been applied to civil engineering and astrophysics datasets. 
URL https://github.com/cgrazian/BICC
 
Title Bayesian nonparametrics conditional vines 
Description The code implements the methodology described in the paper by Barone & Dalla Valle (2023) Bayesian Nonparametric Modelling of Conditional Multidimensional Dependence Structures. 
Type Of Material Computer model/algorithm 
Year Produced 2023 
Provided To Others? Yes  
Impact The methodology developed in the code was applied to a veterinary dataset and to sustainable economics dataset 
URL https://www.tandfonline.com/doi/suppl/10.1080/10618600.2023.2173604?scroll=top&role=tab
 
Title Covid19 vine copula integration 
Description The code implements the methodology described in the paper by Ansell & Dalla Valle (2021) A New Data Integration Framework for Covid-19 Social Media Information. 
Type Of Material Computer model/algorithm 
Year Produced 2023 
Provided To Others? Yes  
Impact The code was applied to combine structured datasets retrieved from official sources and a big unstructured dataset of information collected from social media. 
URL https://github.com/laurenansell/A-New-Data-Integration-Framework-for-Covid-19-Social-Media-Informati...
 
Description Crime risk 
Organisation Devon and Cornwall Police
Country United Kingdom 
Sector Public 
PI Contribution My expertise and intellectual input to develop data integration statistical models
Collaborator Contribution The partner provided data to assist in developing an approach to understanding unreported crime and harm using social media and sentiment analysis.
Impact The methodology adopted for the collaboration with the partner arises from the following outputs: - Ansell L. & Dalla Valle L. (2022) A New Data Integration Framework for COVID-19 Social Media Information. - Sheikhi A., Dalla Valle L. & R. Mesiar (2023) On the use of time-varying vine copulas in multivariate time series analysis. - Dalla Valle L. & Tarantola C. (2022) Data Integration and Graphical Models for Cryptocurrencies. - Ansell, L. & Dalla Valle, L. (2022) Social Media Integration of Flood Data: A Vine Copula-Based Approach.
Start Year 2022