VADA: Value Added Data Systems -- Principles and Architecture

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

Data is everywhere, generated by increasing numbers of applications, devices and users, with few or no guarantees on the format, semantics, and quality. The economic potential of data-driven innovation is enormous, estimated to reach as much as £40B in 2017, by the Centre for Economics and Business Research. To realise this potential, and to provide meaningful data analyses, data scientists must first spend a significant portion of their time (estimated as 50% to 80%) on "data wrangling" - the process of collection, reorganising, and cleaning data.

This heavy toll is due to what is referred as the four V's of big data: Volume - the scale of the data, Velocity - speed of change, Variety - different forms of data, and Veracity - uncertainty of data. There is an urgent need to provide data scientists with a new generation of tools that will unlock the potential of data assets and significantly reduce the data wrangling component. As many traditional tools are no longer applicable in the 4 V's environment, a radical paradigm shift is required. The proposal aims at achieving this paradigm shift by adding value to data, by handling data management tasks in an environment that is fully aware of data and user contexts, and by closely integrating key data management tasks in a way not yet attempted, but desperately needed by many innovative companies in today's data-driven economy.

The VADA research programme will define principles and solutions for Value Added Data Systems, which support users in discovering, extracting, integrating, accessing and interpreting the data of relevance to their questions. In so doing, it uses the context of the user, e.g., requirements in terms of the trade-off between completeness and correctness, and the data context, e.g., its availability, cost, provenance and quality. The user context characterises not only what data is relevant, but also the properties it must exhibit to be fit for purpose. Adding value to data then involves the best effort provision of data to users, along with comprehensive information on the quality and origin of the data provided. Users can provide feedback on the results obtained, enabling changes to all data management tasks, and thus a continuous improvement in the user experience.

Establishing the principles behind Value Added Data Systems requires a revolutionary approach to data management, informed by interlinked research in data extraction, data integration, data quality, provenance, query answering, and reasoning. This will enable each of these areas to benefit from synergies with the others. Research has developed focused results within such sub-disciplines; VADA develops these specialisms in ways that both transform the techniques within the sub-disciplines and enable the development of architectures that bring them together to add value to data.

The commercial importance of the research area has been widely recognised. The VADA programme brings together university researchers with commercial partners who are in desperate need of a new generation of data management tools. They will be contributing to the programme by funding research staff and students, providing substantial amounts of staff time for research collaborations, supporting internships, hosting visitors, contributing challenging real-life case studies, sharing experiences, and participating in technical meetings. These partners are both developers of data management technologies (LogicBlox, Microsoft, Neo) and data user organisations in healthcare (The Christie), e-commerce (LambdaTek, PricePanda), finance (AllianceBernstein), social networks (Facebook), security (Horus), smart cities (FutureEverything), and telecommunications (Huawei).

Planned Impact

The economic impact of relevant activities is difficult to approximate, but the value of the sub-areas of Big Data, Data Integration and Data Quality is forecast to be over $50B by 2017:
- The International Institute of Analytics estimate the Big Data market at $16.1B in 2014, growing 6 times faster than the overall IT market. Projection for 2017 is ~$50B.
- Gartner (2014) estimates the Data Integration tool market at over $2.2B at end 2013, an increase of 9.4% from 2012. Growth rate is above average for the enterprise software market. By 2018 total revenue should be ~$3.6B
- Gartner (2014) estimates the Data Quality market as $960M in software revenue at end 2012 ($2B by 2017), an increase of 12.3% from 2011.
Thus directly associated markets - with users across government, industry, health and commerce - are large and fast growing.

Who will benefit from this research?

Data is central to the efficient operation of many technology development and user organisations, and is the raison d'etre for many others. Here we categorise potential VADA beneficiaries, into:
1. Technology providers of platforms and solutions for collecting, integrating, and aggregating data. Partner examples include LogicBlox, Microsoft, Neo. New business opportunities are likely to emerge, where impact results from the development of techniques to enable more efficient and effective use of available data.
2. Organisations having a need for such platforms. This is almost every organization; our partners include knowledge companies who work with product (LambdaTek, PricePanda), financial (AllianceBernstein), security (Horus), social networking (Facebook), telecommunications (Huawei), governmental (FutureEverything) and healthcare (Christie) data.

All partners have highlighted the importance of this research in their support letters:
* VADA addresses fundamental questions that have great significance (Microsoft),
* The challenge addressed by VADA is a significant one (LogicBlox),
* VADA tackles several problems that are of great interest (Facebook),
* We need an automatic approach to reliable, timely and continuous collection and evaluation of sources against an ever-increasing amount of raw data. Current data collection technologies are neither reliable nor scalable enough. (Horus)
* To remain competitive we need to enrich our product data with extended background data. No technology that currently exists can do this. (LambdaTek).

How might they benefit from this research?

VADA's impact is in line with the RCUK priorities:
1. Contribute toward wealth creation and economic prosperity. VADA will develop techniques and methodologies informing the development of platforms to add value to data. Among the many mechanisms that can realise this, we propose a consultancy spin-out. We believe that this will ease the efficient transfer of knowledge from academia to UK industry, as previously demonstrated by similar successful ventures.
2. Shape/enhance effectiveness of public services. The UK has signed up to the Open Government Declaration, which should make travel easier and healthcare better, and create significant growth for UK industry (http://www.cabinetoffice.gov.uk/news/open-data-measures-autumn-statement). However, exploiting such data involves inter-relating it with other data sources, managing variety and veracity. SMEs such as FutureEverything will benefit from efficient techniques for adding value to such data.
3. Enhance training capacity, knowledge and skills of businesses and organisations. Within many organisations, efficient sharing and use of data is crucial for decision-making. VADA will directly train 11 PhD students, supporting exchange visits, workshops, and a summer school. VADA's academics will be also involved in the design of training courses on Value Added Data Systems for the next generation of higher education post-graduate programmes and skill training courses for the industry.

Publications

10 25 50
 
Description We have provided a complete spectrum of steps required for data science activities in the context of Vadalog including 1) Data Integration and Pre-processing, 2) Statistical Analysis, 3) Machine Learning, 4) Algorithmic Modelling, 5) Probabilistic Reasoning. In each of these parts there has been significant improvements including involvement of further Machine Learning approaches. We have extended the research-related activities of Vadalog on the side of involving further machine learning techniques. This involves extensive studies on Neural Networks and Knowledge Graph Embedding Models. Recent work in theses areas has shown the importance of logical rules. As the core of Vadalog is rule-based reasoning, we launched related work in logical rule injection in such ML-related approaches. As one of the characteristics of Knowledge Graphs is their uncertainty in terms of noisy, missing, and incorrect data, we investigated the effect of noise in the presence of logical rules. We show that by introducing a new loss function that is both pattern-aware and noise-resilient, significant performance issues can be solved.
Exploitation Route 1. The newly developed and researched models from machine learning-based approaches can significantly increase the results of rule mining and reasoning processes beside improving execution officially of Vadalog system.
2. Embedding models are specially designed for link prediction tasks and this characteristic can be used in making more complex steps of logic-based reasoning more efficient.
Sectors Aerospace

Defence and Marine

Digital/Communication/Information Technologies (including Software)

Education

Retail

Transport

URL http://vada.org.uk/
 
Description 1. The findings have been used in DBLP to improve their web data extraction approach. In particular, the enhanced version of OXPath is able to extract data from complex web applications, which was not possible earlier. 2. Other non-academic/company collaborations have been conducted. Results are subject to further progress on these collaborations. 3. After becoming familiar with VADA supported work on mappings of property graphs and on formal semantics of query languages, Neo Technology funded a followup project on providing formal underpinnings and semantics of the Cypher query language of their Neo4j graph database system. 4. A series of meetings has been held with BAE Systems. Significant human effort is currently needed to manually process and integrate data, which can delay effective decision making. As a result, there is interest in the combination of decision support and data integration that is being explored in VADA, and we hope this will lead to a research collaboration in due course. 5. The impact of machine learning approaches and embedding models have been examined with a use case in financial domain and planned top be extended in industry-scale. 6. VADA has significant impact on research partners, in particular recently on leading appliance company Miele and leading bank Sberbank 7. The Central Bank of Italy has continuos interest in the VADA and Vadalog system. There is ample evidence in form of publications of the ongoing and fruitful scientific collaboration.
First Year Of Impact 2019
Sector Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Education,Financial Services, and Management Consultancy
Impact Types Societal

Economic

Policy & public services

 
Description Amazon Web Services Research Credits
Amount $10,760 (USD)
Organisation Amazon.com 
Sector Private
Country United States
Start 08/2018 
End 09/2019
 
Description EPSRC Industrial CASE Doctoral Studentship: AI and Cognitive Computing for Reasoning about Big Data with Application to the Oil and Gas Industry
Amount £200,000 (GBP)
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2017 
End 04/2021
 
Description Efficient Querying of Inconsistent Data
Amount £606,439 (GBP)
Funding ID EP/S003800/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 08/2018 
End 08/2024
 
Description Innovate UK Internet of Things Cities Demonstrator
Amount £856,996 (GBP)
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 06/2016 
End 06/2018
 
Description LAMBDA: Learning, Applying, Multiplying Big Data Analytics
Amount £168,835 (GBP)
Funding ID GA No. 809965 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 06/2018 
End 12/2020
 
Description Neo Technology - industry funding
Amount £180,000 (GBP)
Organisation Neo4j 
Sector Private
Country United States
Start 01/2017 
End 09/2021
 
Description New generation of graph query languages
Amount £59,999 (GBP)
Organisation The Leverhulme Trust 
Sector Charity/Non Profit
Country United Kingdom
Start 08/2022 
End 03/2024
 
Description RS Wolfson
Amount £50,000 (GBP)
Organisation The Royal Society 
Sector Charity/Non Profit
Country United Kingdom
Start 01/2017 
End 12/2021
 
Description Raison Data - Royal Society Research Professorship
Amount £1,304,142 (GGP)
Funding ID RP\R1\201074 
Organisation The Royal Society 
Sector Charity/Non Profit
Country United Kingdom
Start 03/2020 
End 02/2025
 
Description Ratiolytics: a rule-based AI system for reasoning, data wrangling and analytics
Amount £44,123 (GBP)
Funding ID EP/R511742/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 06/2017 
End 04/2018
 
Description Bank of Italy 
Organisation Bank of Italy
Country Italy 
Sector Public 
PI Contribution Bank of Italy (Banca d'Italia - Italy's National Bank) uses VADALOG, the reasoning language generated by the VADA project. Oxford hosted a Bank of Italy Senior Engineer (Luigi Bellomarini) and introduced him to the VADA technology and to the underlying research.
Collaborator Contribution The Bank of Italy adopted the VADALOG language and system and obtained a license from the VADA startup Deep Reaon.ai Based on this software, BoI introduced us to many new use cases and contributed to various research papers. The cooperation has been ongoing from the academic year 2016/17 until beyond the end of the project. .
Impact The outputs are all publication in the list of publications co-authored by Dr Luigi Bellomarini. Please just search for "Bellomarini in th epublication list. The collaboration is in the field of Computer Science with application to Central Bank problems and Economics. For example new methods for detecting the degree of ownership between two companies were deleloped.
Start Year 2016
 
Description LDBC 
Organisation Linked Data Benchmark Council (LDBC)
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution LDBC is a key international organisation supported by multiple industrial partners (Oracle, Neo4j, AWS, TigerGraph etc) in charge of bringing together industry and academia in developing new standards and benchmarks for graph databases. I chair one of its working groups, on formal semantics of query languages, and actively participate in two others (on treatment of null values and on property graph schemas).
Collaborator Contribution They provide us with tools to facilitate collaboration.
Impact The most visible one so far is the forthcoming SIGMOD 2021 paper on keys for property graphs. Others go via ISO.
Start Year 2020
 
Description Neo technology (Neo4j) 
Organisation Neo4j
Country United States 
Sector Private 
PI Contribution We established a joint research project with a leading vendor of graph databases, Neo Technology (based in the UK and Sweden; the name of their product is Neo4j). Our initial goal was to produce a formal semantics of their query language Cypher and to make suggestions about further development of the language. It was then expanded to the design of the new graph query language GQL.
Collaborator Contribution In addition to committing significant amount of time of the core staff, Neo4j has provided funding continuously since 2017.
Impact Full formal semantics of the core language has been developed. Paper describing appeared in SIGMOD 2018 and VLDB 2019. Since then the focus shifted to GQL (paper are to be written).
Start Year 2017
 
Description Peak.ai 
Organisation Peak AI
Country United Kingdom 
Sector Private 
PI Contribution We are working with Peak.ai through a Knowledge Transfer Partnership, to develop and apply techniques for entity resolution and data discovery.
Collaborator Contribution Building on our work in VADA on data discovery, we have been working with Peak on techniques to make the onboarding of customer data more systematic and less labour-intensive.
Impact N/A
Start Year 2019
 
Title JupyterLab environment for Vadalog execution 
Description We have adapted the JupyterLab data science environment to use execute Vadalog, the reasoning language developed in the context of the VADA project. Features include: (1) Rule authoring, execution, (2) Interaction with Python and R, (3) Program analysis and debugging (4) Model Explanations (Proof Trees and Audit Trails), (5) Visualisation 
Type Of Technology Webtool/Application 
Year Produced 2018 
Impact This is a main contribution towards adoption of Vadalog in the wider community, as Jupyter has recently become a popular tool in the data science and research community. We can expect researchers to use Vadalog for specific purposes alongside other tools in their toolchain, without having to adapt to a new environment. 
 
Company Name DeepReason.ai 
Description DeepReason.ai develops a 'Knowledge Graph Platform' for organisations, which uses AI to unify data from multiple areas of the business in order to conduct analysis. 
Year Established 2018 
Impact The company has been acquired by Meltwater Inc. in November 2021 https://www.meltwater.com/en/about/press-releases/meltwater-acquires-deepreason-ai
Website https://deepreason.ai/
 
Company Name The Data Value Factory 
Description The Data Value Factory develops software for cleaning and integrating data. 
Year Established 2018 
Impact The company provides software and services for data preparation.
Website http://thedatavaluefactory.com
 
Description Chair, Formal Semantics Working Group of the Linked Data Benchmark Council 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact The Linked Data Benchmark Council is an organisation that arranges work by academics on behalf of graph database vendors as well as groups that produce new standards for graph query languages. Since 2020, Leonid Libkin leads the formal semantics working group that comprises academics from the UK, France, Germany, Poland, and Chile, and that analyses the emerging standard of graph querying called GQL. The group works in close collaboration with companies such as Neo4j (UK/Sweden), Oracle and TigerGraph (US). Its contributions are already reflected in the new part of the SQL standard for querying graphs, SQL/PGQ.
Year(s) Of Engagement Activity 2020,2021,2022
 
Description Data Wrangling for Big Data 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact This workshop included presentations, demonstrations and posters on work that relates to big data wrangling, with a view to sharing best practice and emerging techniques. The presentations included the ongoing work from the VADA partners as well as presentations from industrial experts. The event intended primarily for data scientists and computer scientists from business/industry and academia. It sparked further discussions about the field of data wrangling, extraction, cleaning and reasoning, i.e., the subjects of the VADA project.
Year(s) Of Engagement Activity 2017
URL https://www.turing.ac.uk/events/data-wrangling-big-data/
 
Description Invited Speaker at International Conference on Model and Data Engineering 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Eight International Conference on Model & Data Engineering (MEDI) will be held from 24 to 26 October 2018 in Marrakesh, Morocco. Its main objective is to provide a forum for the dissemination of research accomplishments and to promote the interaction and collaboration between the models and data research communities. MEDI'2018 provides an international platform for the presentation of research on models and data theory, development of advanced technologies related to models and data and their advanced applications. This international scientific event, initiated by researchers from Euro-Mediterranean countries, aims also at promoting the creation of north-south scientific networks, projects and faculty/student exchanges.
Year(s) Of Engagement Activity 2018
URL https://easychair.org/cfp/MEDI2018
 
Description Invited Speaker at International European Conference on Logics in Artificial Intelligence 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The conference is about logic in AI and I gave a talk about Vadalog which is a logic-based reasoning language for modern AI applications, in particular for knowledge graph systems. I presented recent advances and applications, with a focus on the language Vadalog itself.
Year(s) Of Engagement Activity 2019
URL https://jelia2019.mat.unical.it/invited-speakers#h.p_446ZRly1aeB1
 
Description Invited Talk - Lovelace Lecture And Conferment Of The Lovelace Medal 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact 22 March 2018 Lovelace Lecture And Conferment Of The Lovelace Medal to Prof. Gottlob
Year(s) Of Engagement Activity 2018
URL https://www.bcs.org/category/19248
 
Description Invited plenary speaker at International Conference On Scalable Uncertainty Management 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote talk: Swift Logic for Big Data and Knowledge Graphs
Year(s) Of Engagement Activity 2018
URL http://www.ir.disco.unimib.it/sum2018/invited-speakers/
 
Description Keynote at Austrian Computer Science Day 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact I talked about m adventures with Datalog, walking the thin line between theory and practice.
Year(s) Of Engagement Activity 2017,2019
URL https://acsd2019.ai.wu.ac.at/timetable/event/georg-gottlob/
 
Description Lecture at the Samsung Cambridge Research Centre 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact Lecture emtotltled: A Journey from Web Data Extraction over Knowledge Graphs towards Integrating Rules with Machine Learning

Abstract: This talk first reports about DIADEM, a past ERC-funded project at Oxford University for fully-automated domain-specific web data extraction (http://diadem.cs.ox.ac.uk/). DIADEM loosely integrates machine learning (ML) tasks with transferable rule-based knowledge. This project was very successful and gave rise to a spin-out company. The ML-knowledge integration in Diadem was ad hoc and problem-specific, and therefore, in a follow-up project we designed the VADALOG knowledge graph management system that allows Engineers to realize applications in various areas that make use of a loose integration of rule-based knowledge and ML. Another goal of VADALOG is efficient and expressive reasoning over Big Data. VADALOG was designed in the context of the EPSRC Program Grant VADA (Value-Added Data Systems; https://vada.org.uk/). We describe the language and principles underlying VADALOG, and discuss some applications of VADALOG developed by the DeepReason.ai VADA spin-out (founded in 2018). The journey does not finish here. In the context of the new RAISON DATA project, we aim at a much tighter integration of ML with rule-based knowledge. We will give some motivations and present our initial approach en route to a new type of system.
Year(s) Of Engagement Activity 2020
 
Description Member of the SQL Standard ISO Committee (officially: ISO/IEC JTC1 SC32 WG3) 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact SQL is the main language of relational database systems, used by practically all business and governmental organizations. It is standardized by ISO (International Organization for Standardization). Since 2018, Leonid Libkin is a member of that committee, currently one of only 4 academics influencing the design of this ubiquitous query languages in such areas as handling graph queries and incomplete data.
Year(s) Of Engagement Activity 2018,2019,2020,2021,2022
 
Description Milner Lecture, at Edinburgh University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Milner Lecture 2018: Swift Logic for Big Data and Knowledge Graphs
Year(s) Of Engagement Activity 2018
URL http://wcms.inf.ed.ac.uk/lfcs/events/swift-logic-for-big-data-and-knowledge-graphs
 
Description The EDBT Summer School 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact The EDBT Summer School brings together leading researchers recognised as experts in their fields and provides participants the opportunity to gain deeper insight into current research trends in the database area. In 2017, the theme of the school will be "Adding Value to Data". The scientific topics will cover principles and solutions for adding value to data, that is, for supporting users in discovering, extracting, integrating, accessing and interpreting the data of relevance to their questions, while taking into account the role of the crowd and the impact of dirty data as well as the adoption of responsible data management and analysis processes. The school will be organised around 7 main themes.
The 2017 summer school follows the successful structure of previous EDBT schools: stimulating lectures by leading researchers in the field (two for each main theme), groupwork on assignment, and a lively scientific and social program.
Year(s) Of Engagement Activity 2017
 
Description invited plenary speaker at RuleML+RR 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact invited plenary speaker  at RuleML+RR
Year(s) Of Engagement Activity 2018