VADA: Value Added Data Systems -- Principles and Architecture

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

Data is everywhere, generated by increasing numbers of applications, devices and users, with few or no guarantees on the format, semantics, and quality. The economic potential of data-driven innovation is enormous, estimated to reach as much as £40B in 2017, by the Centre for Economics and Business Research. To realise this potential, and to provide meaningful data analyses, data scientists must first spend a significant portion of their time (estimated as 50% to 80%) on "data wrangling" - the process of collection, reorganising, and cleaning data.

This heavy toll is due to what is referred as the four V's of big data: Volume - the scale of the data, Velocity - speed of change, Variety - different forms of data, and Veracity - uncertainty of data. There is an urgent need to provide data scientists with a new generation of tools that will unlock the potential of data assets and significantly reduce the data wrangling component. As many traditional tools are no longer applicable in the 4 V's environment, a radical paradigm shift is required. The proposal aims at achieving this paradigm shift by adding value to data, by handling data management tasks in an environment that is fully aware of data and user contexts, and by closely integrating key data management tasks in a way not yet attempted, but desperately needed by many innovative companies in today's data-driven economy.

The VADA research programme will define principles and solutions for Value Added Data Systems, which support users in discovering, extracting, integrating, accessing and interpreting the data of relevance to their questions. In so doing, it uses the context of the user, e.g., requirements in terms of the trade-off between completeness and correctness, and the data context, e.g., its availability, cost, provenance and quality. The user context characterises not only what data is relevant, but also the properties it must exhibit to be fit for purpose. Adding value to data then involves the best effort provision of data to users, along with comprehensive information on the quality and origin of the data provided. Users can provide feedback on the results obtained, enabling changes to all data management tasks, and thus a continuous improvement in the user experience.

Establishing the principles behind Value Added Data Systems requires a revolutionary approach to data management, informed by interlinked research in data extraction, data integration, data quality, provenance, query answering, and reasoning. This will enable each of these areas to benefit from synergies with the others. Research has developed focused results within such sub-disciplines; VADA develops these specialisms in ways that both transform the techniques within the sub-disciplines and enable the development of architectures that bring them together to add value to data.

The commercial importance of the research area has been widely recognised. The VADA programme brings together university researchers with commercial partners who are in desperate need of a new generation of data management tools. They will be contributing to the programme by funding research staff and students, providing substantial amounts of staff time for research collaborations, supporting internships, hosting visitors, contributing challenging real-life case studies, sharing experiences, and participating in technical meetings. These partners are both developers of data management technologies (LogicBlox, Microsoft, Neo) and data user organisations in healthcare (The Christie), e-commerce (LambdaTek, PricePanda), finance (AllianceBernstein), social networks (Facebook), security (Horus), smart cities (FutureEverything), and telecommunications (Huawei).

Planned Impact

The economic impact of relevant activities is difficult to approximate, but the value of the sub-areas of Big Data, Data Integration and Data Quality is forecast to be over $50B by 2017:
- The International Institute of Analytics estimate the Big Data market at $16.1B in 2014, growing 6 times faster than the overall IT market. Projection for 2017 is ~$50B.
- Gartner (2014) estimates the Data Integration tool market at over $2.2B at end 2013, an increase of 9.4% from 2012. Growth rate is above average for the enterprise software market. By 2018 total revenue should be ~$3.6B
- Gartner (2014) estimates the Data Quality market as $960M in software revenue at end 2012 ($2B by 2017), an increase of 12.3% from 2011.
Thus directly associated markets - with users across government, industry, health and commerce - are large and fast growing.

Who will benefit from this research?

Data is central to the efficient operation of many technology development and user organisations, and is the raison d'etre for many others. Here we categorise potential VADA beneficiaries, into:
1. Technology providers of platforms and solutions for collecting, integrating, and aggregating data. Partner examples include LogicBlox, Microsoft, Neo. New business opportunities are likely to emerge, where impact results from the development of techniques to enable more efficient and effective use of available data.
2. Organisations having a need for such platforms. This is almost every organization; our partners include knowledge companies who work with product (LambdaTek, PricePanda), financial (AllianceBernstein), security (Horus), social networking (Facebook), telecommunications (Huawei), governmental (FutureEverything) and healthcare (Christie) data.

All partners have highlighted the importance of this research in their support letters:
* VADA addresses fundamental questions that have great significance (Microsoft),
* The challenge addressed by VADA is a significant one (LogicBlox),
* VADA tackles several problems that are of great interest (Facebook),
* We need an automatic approach to reliable, timely and continuous collection and evaluation of sources against an ever-increasing amount of raw data. Current data collection technologies are neither reliable nor scalable enough. (Horus)
* To remain competitive we need to enrich our product data with extended background data. No technology that currently exists can do this. (LambdaTek).

How might they benefit from this research?

VADA's impact is in line with the RCUK priorities:
1. Contribute toward wealth creation and economic prosperity. VADA will develop techniques and methodologies informing the development of platforms to add value to data. Among the many mechanisms that can realise this, we propose a consultancy spin-out. We believe that this will ease the efficient transfer of knowledge from academia to UK industry, as previously demonstrated by similar successful ventures.
2. Shape/enhance effectiveness of public services. The UK has signed up to the Open Government Declaration, which should make travel easier and healthcare better, and create significant growth for UK industry (http://www.cabinetoffice.gov.uk/news/open-data-measures-autumn-statement). However, exploiting such data involves inter-relating it with other data sources, managing variety and veracity. SMEs such as FutureEverything will benefit from efficient techniques for adding value to such data.
3. Enhance training capacity, knowledge and skills of businesses and organisations. Within many organisations, efficient sharing and use of data is crucial for decision-making. VADA will directly train 11 PhD students, supporting exchange visits, workshops, and a summer school. VADA's academics will be also involved in the design of training courses on Value Added Data Systems for the next generation of higher education post-graduate programmes and skill training courses for the industry.

Publications

10 25 50
publication icon
Abel E (2018) SOURCERY

publication icon
Abel E (2020) Targeted evidence collection for uncertain supplier selection in Expert Systems with Applications

publication icon
Abel E (2018) User driven multi-criteria source selection in Information Sciences

publication icon
Alshukaili D (2016) The Semantic Web - ISWC 2016

publication icon
Amendola G (2018) Explainable Certain Answers

publication icon
Arenas M (2018) Expressive Languages for Querying the Semantic Web in ACM Transactions on Database Systems

publication icon
Arenas M. (2016) A datalog-based language for querying RDF graphs? in CEUR Workshop Proceedings

 
Description We have provided a complete spectrum of steps required for data science activities in the context of Vadalog including 1) Data Integration and Pre-processing, 2) Statistical Analysis, 3) Machine Learning, 4) Algorithmic Modelling, 5) Probabilistic Reasoning. In each of these parts there has been significant improvements including involvement of further Machine Learning approaches. We have extended the research-related activities of Vadalog on the side of involving further machine learning techniques. This involves extensive studies on Neural Networks and Knowledge Graph Embedding Models. Recent work in theses areas has shown the importance of logical rules. As the core of Vadalog is rule-based reasoning, we launched related work in logical rule injection in such ML-related approaches. As one of the characteristics of Knowledge Graphs is their uncertainty in terms of noisy, missing, and incorrect data, we investigated the effect of noise in the presence of logical rules. We show that by introducing a new loss function that is both pattern-aware and noise-resilient, significant performance issues can be solved.
Exploitation Route 1. The newly developed and researched models from machine learning-based approaches can significantly increase the results of rule mining and reasoning processes beside improving execution officially of Vadalog system.
2. Embedding models are specially designed for link prediction tasks and this characteristic can be used in making more complex steps of logic-based reasoning more efficient.
Sectors Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Education,Retail,Transport

URL http://vada.org.uk/
 
Description 1. The findings have been used in DBLP to improve their web data extraction approach. In particular, the enhanced version of OXPath is able to extract data from complex web applications, which was not possible earlier. 2. Other non-academic/company collaborations have been conducted. Results are subject to further progress on these collaborations. 3. After becoming familiar with VADA supported work on mappings of property graphs and on formal semantics of query languages, Neo Technology funded a followup project on providing formal underpinnings and semantics of the Cypher query language of their Neo4j graph database system. 4. A series of meetings has been held with BAE Systems. Significant human effort is currently needed to manually process and integrate data, which can delay effective decision making. As a result, there is interest in the combination of decision support and data integration that is being explored in VADA, and we hope this will lead to a research collaboration in due course. 5. The impact of machine learning approaches and embedding models have been examined with a use case in financial domain and planned top be extended in industry-scale.
First Year Of Impact 2019
Sector Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Education,Financial Services, and Management Consultancy
Impact Types Societal,Economic,Policy & public services

 
Description Amazon Web Services Research Credits
Amount $10,760 (USD)
Organisation Amazon.com 
Sector Private
Country United States
Start 09/2018 
End 09/2019
 
Description EPSRC Industrial CASE Doctoral Studentship: AI and Cognitive Computing for Reasoning about Big Data with Application to the Oil and Gas Industry
Amount £200,000 (GBP)
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 10/2017 
End 04/2021
 
Description Innovate UK Internet of Things Cities Demonstrator
Amount £856,996 (GBP)
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 07/2016 
End 06/2018
 
Description LAMBDA: Learning, Applying, Multiplying Big Data Analytics
Amount £168,835 (GBP)
Funding ID GA No. 809965 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 07/2018 
End 12/2020
 
Description Neo Technology - industry funding
Amount £150,000 (GBP)
Organisation Neo4j 
Sector Private
Country United States
Start 01/2017 
End 09/2018
 
Description Ratiolytics: a rule-based AI system for reasoning, data wrangling and analytics
Amount £44,123 (GBP)
Funding ID EP/R511742/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 07/2017 
End 04/2018
 
Description Neo technology (Neo4j) 
Organisation Neo4j
Country United States 
Sector Private 
PI Contribution We established a joint research project with a leading vendor of graph databases, Neo Technology (based in the UK and Sweden; the name of their product is Neo4j). Our main goal was to produce a formal semantics of their query language Cypher and to make suggestions about further development of the language.
Collaborator Contribution In addition to committing significant amount of time of the core staff, Neo also funded a postdoc position in Edinburgh.
Impact Full formal semantics of the core language has been developed. A paper describing it will appear at SIGMOD 2018 in June.
Start Year 2017
 
Title JupyterLab environment for Vadalog execution 
Description We have adapted the JupyterLab data science environment to use execute Vadalog, the reasoning language developed in the context of the VADA project. Features include: (1) Rule authoring, execution, (2) Interaction with Python and R, (3) Program analysis and debugging (4) Model Explanations (Proof Trees and Audit Trails), (5) Visualisation 
Type Of Technology Webtool/Application 
Year Produced 2018 
Impact This is a main contribution towards adoption of Vadalog in the wider community, as Jupyter has recently become a popular tool in the data science and research community. We can expect researchers to use Vadalog for specific purposes alongside other tools in their toolchain, without having to adapt to a new environment. 
 
Company Name DeepReason.ai Ltd 
Description DeepReason.ai offers a KGMS that leverages recent breakthroughs in logical reasoning and database theory developed at the University of Oxford. 
Year Established 2018 
Impact The company has just been created.
Website http://deepreason.ai/
 
Company Name THE DATA VALUE FACTORY LIMITED 
Description The company provides software and services for data preparation. 
Year Established 2018 
Impact The company provides software and services for data preparation.
Website http://thedatavaluefactory.com/
 
Description Data Wrangling for Big Data 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact This workshop included presentations, demonstrations and posters on work that relates to big data wrangling, with a view to sharing best practice and emerging techniques. The presentations included the ongoing work from the VADA partners as well as presentations from industrial experts. The event intended primarily for data scientists and computer scientists from business/industry and academia. It sparked further discussions about the field of data wrangling, extraction, cleaning and reasoning, i.e., the subjects of the VADA project.
Year(s) Of Engagement Activity 2017
URL https://www.turing.ac.uk/events/data-wrangling-big-data/
 
Description Invited Speaker at International Conference on Model and Data Engineering 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Eight International Conference on Model & Data Engineering (MEDI) will be held from 24 to 26 October 2018 in Marrakesh, Morocco. Its main objective is to provide a forum for the dissemination of research accomplishments and to promote the interaction and collaboration between the models and data research communities. MEDI'2018 provides an international platform for the presentation of research on models and data theory, development of advanced technologies related to models and data and their advanced applications. This international scientific event, initiated by researchers from Euro-Mediterranean countries, aims also at promoting the creation of north-south scientific networks, projects and faculty/student exchanges.
Year(s) Of Engagement Activity 2018
URL https://easychair.org/cfp/MEDI2018
 
Description Invited Speaker at International European Conference on Logics in Artificial Intelligence 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The conference is about logic in AI and I gave a talk about Vadalog which is a logic-based reasoning language for modern AI applications, in particular for knowledge graph systems. I presented recent advances and applications, with a focus on the language Vadalog itself.
Year(s) Of Engagement Activity 2019
URL https://jelia2019.mat.unical.it/invited-speakers#h.p_446ZRly1aeB1
 
Description Invited Talk - Lovelace Lecture And Conferment Of The Lovelace Medal 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact 22 March 2018 Lovelace Lecture And Conferment Of The Lovelace Medal to Prof. Gottlob
Year(s) Of Engagement Activity 2018
URL https://www.bcs.org/category/19248
 
Description Invited plenary speaker at International Conference On Scalable Uncertainty Management 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote talk: Swift Logic for Big Data and Knowledge Graphs
Year(s) Of Engagement Activity 2018
URL http://www.ir.disco.unimib.it/sum2018/invited-speakers/
 
Description Keynote at Austrian Computer Science Day 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact I talked about m adventures with Datalog, walking the thin line between theory and practice.
Year(s) Of Engagement Activity 2017,2019
URL https://acsd2019.ai.wu.ac.at/timetable/event/georg-gottlob/
 
Description Milner Lecture, at Edinburgh University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Milner Lecture 2018: Swift Logic for Big Data and Knowledge Graphs
Year(s) Of Engagement Activity 2018
URL http://wcms.inf.ed.ac.uk/lfcs/events/swift-logic-for-big-data-and-knowledge-graphs
 
Description The EDBT Summer School 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact The EDBT Summer School brings together leading researchers recognised as experts in their fields and provides participants the opportunity to gain deeper insight into current research trends in the database area. In 2017, the theme of the school will be "Adding Value to Data". The scientific topics will cover principles and solutions for adding value to data, that is, for supporting users in discovering, extracting, integrating, accessing and interpreting the data of relevance to their questions, while taking into account the role of the crowd and the impact of dirty data as well as the adoption of responsible data management and analysis processes. The school will be organised around 7 main themes.
The 2017 summer school follows the successful structure of previous EDBT schools: stimulating lectures by leading researchers in the field (two for each main theme), groupwork on assignment, and a lively scientific and social program.
Year(s) Of Engagement Activity 2017
 
Description invited plenary speaker at RuleML+RR 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact invited plenary speaker  at RuleML+RR
Year(s) Of Engagement Activity 2018