Dependence Modelling with Vine Copulas for the Integration of Unstructured and Structured Data
Lead Research Organisation:
Plymouth University
Department Name: Sch of Eng, Comp and Math (SECaM)
Abstract
The project will develop a statistical data integration methodology, never considered before, that utilizes multiple sources of information to provide more accurate predictions than those currently available. Today we are living in the Big Data era, where masses of data in traditional formats are produced by companies and organizations and large quantities of information, mostly unstructured, are generated by social media, every second. However, are we effectively and efficiently exploiting all the information available to us from official and social media sources? The answer to this question is definitely, no. Most of the statistical approaches used to solve real-world problems are based on a single source of information and, although preliminary work attempting to leverage social media data exists, there are currently no comprehensive and functional methodologies able to fully capitalize on unstructured information and its associations with other available structured data. The consequence is that precious information contained in unstructured online data continues to be neglected and lost. While technology and digitalization advances are shaping the world, statistics is struggling to keep pace and it is currently in critical and urgent need of revolutionizing its methods and practices. This proposal aims at filling this gap, giving life to a pioneering and transformative statistical data integration methodology, fully leveraging the power of different sources of information, such as traditional and online-generated data. The project will support early-stage research on integrating unstructured and structured data using a new methodology based on vine copulas that will form the basis of future analyses, which will lead to a radical transformation of current data approaches, propelling statistics towards the future era. For this research, which is early-stage, yet will bring immediately usable results, the methodology will be applied to data of crimes committed in the South West region of the UK, integrating official police information, provided by our project partner Devon and Cornwall Police (DCP), with crime data discussed on different social media platforms. Our approach will provide a more thorough and realistic appraisal of the volume and severity of crimes in specific locations of the South West, since it will also account for hidden crimes, unreported to the police, but emerging from social media. The results of this project will be used by DCP to more effectively plan and organize their interventions and to efficiently allocate resources in targeted areas. Providing a deeper and more accurate knowledge of the geographical locations of criminal offences, including unreported crimes, this project will assist the police to better support communities in high criminal risk areas with timely interventions, making people feel more protected and safer. This will promote social inclusion and more equitable communities, especially in disadvantaged areas that are mostly affected by high criminality levels, including crimes which are not reported via traditional channels.
This project, initially targeting the South West of the UK, will lay the foundation for future grant applications extending the geographical area under assessment at national level.
In addition, due to the endless number of possible applications of our methodology, this project will be the milestone that will generate further breakthroughs in any other area of science where multiple data sources are available and accurate predictions are needed.
This project is timely since it addresses the urgent need to fully leverage the social media information currently available, but not taken advantage of. This research will provide a key opportunity for the UK to secure a leading international position at the forefront of advances in knowledge extraction, leading to huge social and economic benefits.
This project, initially targeting the South West of the UK, will lay the foundation for future grant applications extending the geographical area under assessment at national level.
In addition, due to the endless number of possible applications of our methodology, this project will be the milestone that will generate further breakthroughs in any other area of science where multiple data sources are available and accurate predictions are needed.
This project is timely since it addresses the urgent need to fully leverage the social media information currently available, but not taken advantage of. This research will provide a key opportunity for the UK to secure a leading international position at the forefront of advances in knowledge extraction, leading to huge social and economic benefits.
People |
ORCID iD |
Luciana Dalla Valle (Principal Investigator) |
Publications
Ansell L
(2023)
A new data integration framework for Covid-19 social media information.
in Scientific reports
Ansell L
(2022)
Social Media Integration of Flood Data: A Vine Copula-Based Approach
in Journal of Environmental Informatics
Barone R
(2023)
Bayesian Nonparametric Modeling of Conditional Multidimensional Dependence Structures
in Journal of Computational and Graphical Statistics
Dalla Valle L
(2022)
Data Integration and Graphical Models for Cryptocurrencies
Dalla Valle, L.
(2023)
Bayesian nonparametric inference for conditional vine copulas
Grazian C
(2022)
Approximate Bayesian conditional copulas
in Computational Statistics & Data Analysis
Description | The work undertaken lead to the application of data integration research methods for structured historical data and unstructured social media data. The next steps will include the development of novel statistical methodology using improved statistical tools. |
Exploitation Route | Academics working in applied fields will be able to use our results to integrate structured and unstructured data in various contexts. Industrialists will be able to applied the developed methods and algorithms in their area of application. |
Sectors | Communities and Social Services/Policy Digital/Communication/Information Technologies (including Software) Energy Environment Financial Services and Management Consultancy Healthcare |
Description | The methods and algorithm developed as part of the work undertaken, have contributed to the society by illustrating members of the public how social media information can be used to improve knowledge of, e.g, environmental and health phenomena. |
Impact Types | Societal |
Description | Research Reboot Grant |
Amount | £1,000 (GBP) |
Funding ID | 42112 |
Organisation | London Mathematical Society |
Sector | Academic/University |
Country | United Kingdom |
Start | 05/2022 |
End | 06/2022 |
Description | Short-Term Scientific Mission |
Amount | € 1,440 (EUR) |
Organisation | European Cooperation in Science and Technology (COST) |
Sector | Public |
Country | Belgium |
Start | 05/2022 |
End | 06/2022 |
Description | Short-Term Scientific Mission |
Amount | € 2,000 (EUR) |
Organisation | European Cooperation in Science and Technology (COST) |
Sector | Public |
Country | Belgium |
Start | 05/2023 |
End | 07/2023 |
Title | BICC R package |
Description | The BICC R package implements all the methodologies presented in the paper by Grazian, Dalla Valle & Liseo (2022) Approximate Bayesian Conditional Copulas. |
Type Of Material | Computer model/algorithm |
Year Produced | 2022 |
Provided To Others? | Yes |
Impact | The code included in the BICC R package allows to model the dependence of complex phenomena and has been applied to civil engineering and astrophysics datasets. |
URL | https://github.com/cgrazian/BICC |
Title | Bayesian nonparametrics conditional vines |
Description | The code implements the methodology described in the paper by Barone & Dalla Valle (2023) Bayesian Nonparametric Modelling of Conditional Multidimensional Dependence Structures. |
Type Of Material | Computer model/algorithm |
Year Produced | 2023 |
Provided To Others? | Yes |
Impact | The methodology developed in the code was applied to a veterinary dataset and to sustainable economics dataset |
URL | https://www.tandfonline.com/doi/suppl/10.1080/10618600.2023.2173604?scroll=top&role=tab |
Title | Covid19 vine copula integration |
Description | The code implements the methodology described in the paper by Ansell & Dalla Valle (2021) A New Data Integration Framework for Covid-19 Social Media Information. |
Type Of Material | Computer model/algorithm |
Year Produced | 2023 |
Provided To Others? | Yes |
Impact | The code was applied to combine structured datasets retrieved from official sources and a big unstructured dataset of information collected from social media. |
URL | https://github.com/laurenansell/A-New-Data-Integration-Framework-for-Covid-19-Social-Media-Informati... |
Description | Crime risk |
Organisation | Devon and Cornwall Police |
Country | United Kingdom |
Sector | Public |
PI Contribution | My expertise and intellectual input to develop data integration statistical models |
Collaborator Contribution | The partner provided data to assist in developing an approach to understanding unreported crime and harm using social media and sentiment analysis. |
Impact | The methodology adopted for the collaboration with the partner arises from the following outputs: - Ansell L. & Dalla Valle L. (2022) A New Data Integration Framework for COVID-19 Social Media Information. - Sheikhi A., Dalla Valle L. & R. Mesiar (2023) On the use of time-varying vine copulas in multivariate time series analysis. - Dalla Valle L. & Tarantola C. (2022) Data Integration and Graphical Models for Cryptocurrencies. - Ansell, L. & Dalla Valle, L. (2022) Social Media Integration of Flood Data: A Vine Copula-Based Approach. |
Start Year | 2022 |