VADA: Value Added Data Systems -- Principles and Architecture

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

Data is everywhere, generated by increasing numbers of applications, devices and users, with few or no guarantees on the format, semantics, and quality. The economic potential of data-driven innovation is enormous, estimated to reach as much as £40B in 2017, by the Centre for Economics and Business Research. To realise this potential, and to provide meaningful data analyses, data scientists must first spend a significant portion of their time (estimated as 50% to 80%) on "data wrangling" - the process of collection, reorganising, and cleaning data.

This heavy toll is due to what is referred as the four V's of big data: Volume - the scale of the data, Velocity - speed of change, Variety - different forms of data, and Veracity - uncertainty of data. There is an urgent need to provide data scientists with a new generation of tools that will unlock the potential of data assets and significantly reduce the data wrangling component. As many traditional tools are no longer applicable in the 4 V's environment, a radical paradigm shift is required. The proposal aims at achieving this paradigm shift by adding value to data, by handling data management tasks in an environment that is fully aware of data and user contexts, and by closely integrating key data management tasks in a way not yet attempted, but desperately needed by many innovative companies in today's data-driven economy.

The VADA research programme will define principles and solutions for Value Added Data Systems, which support users in discovering, extracting, integrating, accessing and interpreting the data of relevance to their questions. In so doing, it uses the context of the user, e.g., requirements in terms of the trade-off between completeness and correctness, and the data context, e.g., its availability, cost, provenance and quality. The user context characterises not only what data is relevant, but also the properties it must exhibit to be fit for purpose. Adding value to data then involves the best efort provision of data to users, along with comprehensive information on the quality and origin of the data provided. Users can provide feedback on the results obtained, enabling changes to all data management tasks, and thus a continuous improvement in the user experience.

Establishing the principles behind Value Added Data Systems requires a revolutionary approach to data management, informed by interlinked research in data extraction, data integration, data quality, provenance, query answering, and reasoning. This will enable each of these areas to benefit from synergies with the others. Research has developed focused results within such sub-disciplines; VADA develops these specialisms in ways that both transform the techniques within the sub-disciplines and enable the development of architectures that bring them together to add value to data.

The commercial importance of the research area has been widely recognised. The VADA programme brings together university researchers with commercial partners who are in desperate need of a new generation of data management tools. They will be contributing to the programme by funding research staff and students, providing substantial amounts of staff time for research collaborations, supporting internships, hosting visitors, contributing challenging real-life case studies, sharing experiences, and participating in technical meetings. These partners are both developers of data management technologies (LogicBlox, Microsoft, Neo) and data user organisations in healthcare (The Christie), e-commerce (LambdaTek, PricePanda), finance (AllianceBernstein), social networks (Facebook), security (Horus), smart cities (FutureEverything), and telecommunications (Huawei).

Planned Impact

The economic impact of relevant activities is difficult to approximate, but the value of the sub-areas of Big Data, Data Integration and Data Quality is forecast to be over $50B by 2017:
- The International Institute of Analytics estimate the Big Data market at $16.1B in 2014, growing 6 times faster than the overall IT market. Projection for 2017 is ~$50B.
- Gartner (2014) estimates the Data Integration tool market at over $2.2B at end 2013, an increase of 9.4% from 2012. Growth rate is above average for the enterprise software market. By 2018 total revenue should be ~$3.6B
- Gartner (2014) estimates the Data Quality market as $960M in software revenue at end 2012 ($2B by 2017), an increase of 12.3% from 2011.
Thus directly associated markets - with users across government, industry, health and commerce - are large and fast growing.

Who will benefit from this research?

Data is central to the efficient operation of many technology development and user organisations, and is the raison d'etre for many others. Here we categorise potential VADA beneficiaries, into:
1. Technology providers of platforms and solutions for collecting, integrating, and aggregating data. Partner examples include LogicBlox, Microsoft, Neo. New business opportunities are likely to emerge, where impact results from the development of techniques to enable more efficient and effective use of available data.
2. Organisations having a need for such platforms. This is almost every organization; our partners include knowledge companies who work with product (LambdaTek, PricePanda), financial (AllianceBernstein), security (Horus), social networking (Facebook), telecommunications (Huawei), governmental (FutureEverything) and healthcare (Christie) data.

All partners have highlighted the importance of this research in their support letters:
* VADA addresses fundamental questions that have great significance (Microsoft),
* The challenge addressed by VADA is a significant one (LogicBlox),
* VADA tackles several problems that are of great interest (Facebook),
* We need an automatic approach to reliable, timely and continuous collection and evaluation of sources against an ever-increasing amount of raw data. Current data collection technologies are neither reliable nor scalable enough. (Horus)
* To remain competitive we need to enrich our product data with extended background data. No technology that currently exists can do this. (LambdaTek).

How might they benefit from this research?

VADA's impact is in line with the RCUK priorities:
1. Contribute toward wealth creation and economic prosperity. VADA will develop techniques and methodologies informing the development of platforms to add value to data. Among the many mechanisms that can realise this, we propose a consultancy spin-out. We believe that this will ease the efficient transfer of knowledge from academia to UK industry, as previously demonstrated by similar successful ventures.
2. Shape/enhance effectiveness of public services. The UK has signed up to the Open Government Declaration, which should make travel easier and healthcare better, and create significant growth for UK industry (http://www.cabinetoffice.gov.uk/news/open-data-measures-autumn-statement). However, exploiting such data involves inter-relating it with other data sources, managing variety and veracity. SMEs such as FutureEverything will benefit from efficient techniques for adding value to such data.
3. Enhance training capacity, knowledge and skills of businesses and organisations. Within many organisations, efficient sharing and use of data is crucial for decision-making. VADA will directly train 11 PhD students, supporting exchange visits, workshops, and a summer school. VADA's academics will be also involved in the design of training courses on Value Added Data Systems for the next generation of higher education post-graduate programmes and skill training courses for the industry.

Publications

10 25 50
publication icon
Abel E (2018) SOURCERY

publication icon
Abel E (2018) User driven multi-criteria source selection in Information Sciences

publication icon
Alshukaili D (2016) The Semantic Web - ISWC 2016

publication icon
Amendola G (2018) Explainable Certain Answers

publication icon
Arenas M (2018) Expressive Languages for Querying the Semantic Web in ACM Transactions on Database Systems

publication icon
Arenas M. (2016) A datalog-based language for querying RDF graphs? in CEUR Workshop Proceedings

publication icon
Barcelo P (2016) Order-Invariant Types and Their Applications in Logical Methods in Computer Science

 
Description 1. We have further developed the Vadalog language, which aimed at describing various complex aspects related to big data processing and knowledge representation in a form of concise and intuitive human-readable declarative instructions. Vadalog, a subset of datalog+/-, is a core language of the overall data wrangling process and is now in its active development.
2. We have been developing the main Vadalog engine, executing Vadalog programs. The first prototype is a proof of concept and was utilized as an evaluation framework for performance testing and comparative analysis of different possible implementations and technologies. Different data wrangling prototypical modules were implemented based on this framework: e.g., OXPath-based transducer for data extraction (developed at University of Oxford) and schema mapping transducer (developed at the University of Manchester). The second prototype is based on the lessons learned and implements its own interpretor able to process big volumes of data in a stream mode.
3. Web Data Extraction is yet another important aspect of web data wrangling methodology. Withing the scope of the VADA project we have considerably improved OXPath approach to crawl complex web application and extract relevant data, transforming it into different structured formats required by third-party applications. The improvements are related to more robust simulation of user interactions as well as a comprehensive transformation of extracted data for data consumers, such as Vada Transducers.
4. Developed the first formal semantics of an expressive fragment of SQL, the main query language of commercial relational DBMSs, and fully explained the use of many valued logics in query evaluation.
5. Classified the complexity of key tasks associated with mappings of graph data, in graph data integration and exchange.
6. Developed scalable parallel algorithms for evaluating queries on very large graph databases.
7. Architectures for Data Wrangling: The VADA architecture provides end-to-end data wrangling, with dynamic orchestration of components that are sensitive to the data context, and results produced that are sensitive to the user context.
8. Multi-criteria Source Selection. Multi-criteria decision support techniques have been adapted and extended to support the selection of data sources that best meet user-specified criteria. In addition, techniques have been developed that target feedback from users or crowd workers to cost-effectively remove the risks that arise from uncertainty in criteria values.
9. Developed an integration with JupyterLab, the popular data science and research platform. This integration allows users to execute Vadalog programs, visualize their results and explanations. This is a major step towards fostering stronger adoption of Vadalog in the wider research community.
Exploitation Route 1. Our work on the Vadalog language and the Vadalog engine for the efficient execution of Vadalog programs can potentially address various problems related to the big data processing and replace the traditional labor intensive ETL processes. Our technology can be applied in sectors leveraging the Big Data technology and Internet of Things (IoT), for instance, in a huge sectors of Industry and Transportations. Furthermore, our findings can be used in the XaaS paradigm, effectively processing a huge amount of data transferred between different services.
2. The web data extraction methodology proposed has several advantages, which can be used in different sectors of human activities. 1) We are doing a great step forward a semantic web, the Web which is accessible for automatic processing and analysis. Data enriched with its semantic description can be used in and considerably improve approaches to Information Search, Competitive Intelligence, and Opinion Mining (to mention a few). 2) Web data identified with our technology can also make the Web more accessible for people with different impairments (physically impaired, dyslexic, hearing impaired, partially sighted, and blind), adapting and transcoding web applications with identified information blocks into a convenient representation.
Sectors Digital/Communication/Information Technologies (including Software)

URL http://vada.org.uk/
 
Description 1. The findings have been used in DBLP to improve their web data extraction approach. In particular, the enhanced version of OXPath is able to extract data from complex web applications, which was not possible earlier. 2. Other non-academic/company collaborations have been conducted. Results are subject to further progress on these collaborations. 3. After becoming familiar with VADA supported work on mappings of property graphs and on formal semantics of query languages, Neo Technology funded a followup project on providing formal underpinnings and semantics of the Cypher query language of their Neo4j graph database system. 4. A series of meetings has been held with BAE Systems. Significant human effort is currently needed to manually process and integrate data, which can delay effective decision making. As a result, there is interest in the combination of decision support and data integration that is being explored in VADA, and we hope this will lead to a research collaboration in due course.
Sector Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy
Impact Types Societal,Economic,Policy & public services

 
Description Amazon Web Services Research Credits
Amount $10,760 (USD)
Organisation Amazon.com 
Sector Private
Country United States
Start 09/2018 
End 09/2019
 
Description EPSRC Industrial CASE Doctoral Studentship: AI and Cognitive Computing for Reasoning about Big Data with Application to the Oil and Gas Industry
Amount £200,000 (GBP)
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 10/2017 
End 04/2021
 
Description Innovate UK Internet of Things Cities Demonstrator
Amount £856,996 (GBP)
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 07/2016 
End 06/2018
 
Description LAMBDA: Learning, Applying, Multiplying Big Data Analytics
Amount £168,835 (GBP)
Funding ID GA No. 809965 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 07/2018 
End 12/2020
 
Description Neo Technology - industry funding
Amount £150,000 (GBP)
Organisation Neo4j 
Start 01/2017 
End 09/2018
 
Description Ratiolytics: a rule-based AI system for reasoning, data wrangling and analytics
Amount £44,123 (GBP)
Funding ID EP/R511742/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 07/2017 
End 04/2018
 
Description Neo technology (Neo4j) 
Organisation Neo4j
PI Contribution We established a joint research project with a leading vendor of graph databases, Neo Technology (based in the UK and Sweden; the name of their product is Neo4j). Our main goal was to produce a formal semantics of their query language Cypher and to make suggestions about further development of the language.
Collaborator Contribution In addition to committing significant amount of time of the core staff, Neo also funded a postdoc position in Edinburgh.
Impact Full formal semantics of the core language has been developed. A paper describing it will appear at SIGMOD 2018 in June.
Start Year 2017
 
Title JupyterLab environment for Vadalog execution 
Description We have adapted the JupyterLab data science environment to use execute Vadalog, the reasoning language developed in the context of the VADA project. Features include: (1) Rule authoring, execution, (2) Interaction with Python and R, (3) Program analysis and debugging (4) Model Explanations (Proof Trees and Audit Trails), (5) Visualisation 
Type Of Technology Webtool/Application 
Year Produced 2018 
Impact This is a main contribution towards adoption of Vadalog in the wider community, as Jupyter has recently become a popular tool in the data science and research community. We can expect researchers to use Vadalog for specific purposes alongside other tools in their toolchain, without having to adapt to a new environment. 
 
Company Name DeepReason.ai Ltd 
Description DeepReason.ai offers a KGMS that leverages recent breakthroughs in logical reasoning and database theory developed at the University of Oxford. 
Year Established 2018 
Impact The company has just been created.
Website http://deepreason.ai/
 
Company Name THE DATA VALUE FACTORY LIMITED 
Description The company provides software and services for data preparation. 
Year Established 2018 
Impact The company provides software and services for data preparation.
Website http://thedatavaluefactory.com/
 
Description Data Wrangling for Big Data 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact This workshop included presentations, demonstrations and posters on work that relates to big data wrangling, with a view to sharing best practice and emerging techniques. The presentations included the ongoing work from the VADA partners as well as presentations from industrial experts. The event intended primarily for data scientists and computer scientists from business/industry and academia. It sparked further discussions about the field of data wrangling, extraction, cleaning and reasoning, i.e., the subjects of the VADA project.
Year(s) Of Engagement Activity 2017
URL https://www.turing.ac.uk/events/data-wrangling-big-data/
 
Description Invited Speaker at International Conference on Model and Data Engineering 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The Eight International Conference on Model & Data Engineering (MEDI) will be held from 24 to 26 October 2018 in Marrakesh, Morocco. Its main objective is to provide a forum for the dissemination of research accomplishments and to promote the interaction and collaboration between the models and data research communities. MEDI'2018 provides an international platform for the presentation of research on models and data theory, development of advanced technologies related to models and data and their advanced applications. This international scientific event, initiated by researchers from Euro-Mediterranean countries, aims also at promoting the creation of north-south scientific networks, projects and faculty/student exchanges.
Year(s) Of Engagement Activity 2018
URL https://easychair.org/cfp/MEDI2018
 
Description Invited Talk - Lovelace Lecture And Conferment Of The Lovelace Medal 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Industry/Business
Results and Impact 22 March 2018 Lovelace Lecture And Conferment Of The Lovelace Medal to Prof. Gottlob
Year(s) Of Engagement Activity 2018
URL https://www.bcs.org/category/19248
 
Description Invited plenary speaker at International Conference On Scalable Uncertainty Management 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Keynote talk: Swift Logic for Big Data and Knowledge Graphs
Year(s) Of Engagement Activity 2018
URL http://www.ir.disco.unimib.it/sum2018/invited-speakers/
 
Description Milner Lecture, at Edinburgh University 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Milner Lecture 2018: Swift Logic for Big Data and Knowledge Graphs
Year(s) Of Engagement Activity 2018
URL http://wcms.inf.ed.ac.uk/lfcs/events/swift-logic-for-big-data-and-knowledge-graphs
 
Description The EDBT Summer School 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact The EDBT Summer School brings together leading researchers recognised as experts in their fields and provides participants the opportunity to gain deeper insight into current research trends in the database area. In 2017, the theme of the school will be "Adding Value to Data". The scientific topics will cover principles and solutions for adding value to data, that is, for supporting users in discovering, extracting, integrating, accessing and interpreting the data of relevance to their questions, while taking into account the role of the crowd and the impact of dirty data as well as the adoption of responsible data management and analysis processes. The school will be organised around 7 main themes.
The 2017 summer school follows the successful structure of previous EDBT schools: stimulating lectures by leading researchers in the field (two for each main theme), groupwork on assignment, and a lively scientific and social program.
Year(s) Of Engagement Activity 2017
 
Description invited plenary speaker at RuleML+RR 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact invited plenary speaker  at RuleML+RR
Year(s) Of Engagement Activity 2018