AnaLOG: Datalog Extensions for the Analysis of Static and Streaming Data (Ext.)

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

This is an extension of the fellowship "MaSI3: A Massively Scalable Intelligent Information Infrastructure" (EP/K00607X/1).

Intelligent data management techniques play a key role in areas such as healthcare, business, and government. Healthcare providers such as Kaiser Permanente use such techniques for auditing; companies such as RailComplete use them to verify transport infrastructure designs; and oil producers such as StatOil analyse streaming sensor data to diagnose faults and prevent failures. To simplify the management of the data, intelligent information systems (IISs) provide services that (i) capture the semantics of the data using background knowledge about the application domain, and (ii) use reasoning to infer information implicit in the data and the background knowledge. The vision of the MaSI3 fellowship is to make intelligent information systems a reality by developing scalable reasoning and query answering techniques. The techniques developed in the project to date provide the foundation for a new IIS called RDFox. The system is based on datalog -- a language that models background knowledge of IISs using 'if-then' rules.

Extensive engagement with users within MaSI3 has revealed great potential of RDFox for data analysis -- an exciting new application of IISs. The term 'data analysis' covers a broad range of techniques, which can include searching for patterns and predicting future behaviour using statistical and machine learning algorithms. In many cases, however, data analysis involves the use of data manipulation tasks that aggregate data, verify properties, or answer queries. Such tasks are typically solved imperatively (e.g., using languages such as Java or Scala) by specifying how to manipulate the data, which is undesirable because the objective of the analysis is often obscured by evaluation concerns. It has been argued that data analysis should be declarative: users should describe what the desired output is, rather than how to compute it. For example, instead of computing shortest paths in a graph using a concrete algorithm, one should (i) describe what a path length is, and (ii) state that only paths of minimum length are needed. Such a specification is independent of evaluation details, which allows the user to focus his attention on the task at hand. An evaluation strategy can be chosen later to satisfy specific requirements; for example, parallelisation or incremental techniques can be reused 'for free' to speed up computation or update the output after applying a change to the input.

A key problem on the path to declarative data analysis is to design a language that can express the relevant tasks. Datalog has been identified as a natural starting point: its expressivity and complexity are well understood, and it is already used to declaratively capture domain knowledge in IISs. My work in the MaSI3 fellowship confirms the potential of datalog, but it has also revealed the inability of datalog to express several natural and common data analysis problems. For example, datalog cannot answer the bill of materials query (i.e., count the occurrences of a part in a hierarchical product structure); moreover, basic datalog cannot express the shortest paths problem, and datalog extensions that can express this problem are inefficient when used with known reasoning algorithms. Furthermore, there are challenges in using datalog in a streaming setting (i.e., where data is produced continuously).

Thus, the objective of this fellowship extension is to develop datalog extensions for data analysis in IISs, establish links with known problem solving methods (e.g., dynamic programming), and evaluate the results with my collaborators. My main research problems are about language design, and are thus to an extent independent of the specific evaluation methods. I will validate my results by implementing them in RDFox, but they can also be implemented in Big Data frameworks such as Hadoop and Spark.

Planned Impact

* Beneficiaries

Industry increasingly depends on producing and extracting value from data, and intelligent information systems (IISs) aim to improve the tasks surrounding data storage, management, and exploitation. The vision of the MaSI3 fellowship (EP/K00607X/1) is to make intelligent information systems a reality by developing scalable reasoning and query answering techniques. The research conducted in the fellowship thus far has lead to the development of a new IIS called RDFox, which has attracted considerable interest from industry. The goal of this project is to extend the languages and techniques in IISs to support declarative data analysis.

Data analysis underpins human activity in areas as diverse as academic research, business, healthcare, and governance. Since most of us engage in these activities on a daily basis, the research conducted in this project will benefit the wider society in the long run. In the shorter term, beneficiaries will include users and developers of IISs in academia and industry, and the Case for Support document presents three example use cases that showcase the practical applicability of the envisaged technical results.

* Ongoing Commercialisation Activities

As mentioned in the Case for Support document, the University of Oxford and I are currently working on spinning out two companies that will commercialise the results of the MaSI3 fellowship.

The first spinout company, called Covatic, aims to develop a technical infrastructure for personalised media broadcasting, where programme schedules are generated (semi)automatically based on users' preferences and editorial policies. RDFox will be used for storing the metadata associated with media assets, and its reasoning capabilities will be used to interpret editorial policies and users' preferences to derive schedules.

The second spinout company, called Fox Data, will focus on integration of enterprise data -- a problem that is highly relevant to numerous (national and international) SMEs who cannot afford the cost and inflexibility of an integrated Enterprise Resource Planning solution. While the company will initially focus on homogenising the structural aspects of enterprise data, analysing the integrated data is the next logical step that will considerably increase the return on investment of data integration. Thus, extending data integration with data analysis would be of great benefit to Fox Data customers, allowing them to manage and exploit their data more effectively. Consequently, Fox Data will provide an ideal exploitation pathway for the results of this projects.

In both cases, the UK economy will benefit through creation of new highly skilled jobs. Moreover, the University of Oxford will benefit through license fee revenue, and it will also hold a stake in both companies, which has the potential of generating considerable income.

* Dissemination and Engagement

A range of activities will ensure the widest possible dissemination of this project's results and engagement with anticipated beneficiaries. First, I will continue publishing the results of my research in top journals and conferences, with a specific goal of broadening the range of venues. Second, close collaboration with my project partners (see Letters of Support) will provide me with practical use cases, and joint evaluation based on concrete use cases will provide another important exploitation pathway for this project. Third, I will continue collaborating with the developers of IISs in both academia and industry; for example, Oracle are already funding my research via an unrestricted grant of £55k and have expressed great interest in the results of this project. Fourth, I will continue to make all project outputs available from the project web site, including papers, presentations, tutorials, and software.

Publications

10 25 50
 
Description The main findings thus far involve optimised algorithms for Datalog reasoning as described in the publications presented at the ISWC 2019, AAAI 2019, and CIKM 2019 conferences.
Exploitation Route These algorithms are of interest to the developers of commercial data management systems, such as the Oxford Semantic Technologies startup.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description This project is an extension of the MaSI3 fellowship, and as such the impact of this project extends the impact of the MaSI3 fellowship. In particular, a significant focus of this project was on further development of the techniques for Datalog reasoning. These techniques are finding their ways into the RDFox semantic management system, which is currently being developed in the Oxford Semantic Technologies spinout. So the impact involves the development of efficient Datalog reasoning algorithms, which are subsequently taken up by a commercial organisation.
First Year Of Impact 2019
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Title RDFox 
Description Triple store / graph DB 
Type Of Technology Software 
Year Produced 2016 
Impact Basis for Covatic and OST spin-outs 
URL https://www.cs.ox.ac.uk/isg/tools/RDFox/
 
Company Name Oxford Semantic Technologies Ltd 
Description The company aims to convert RDFox -- a major output of the MaSI3 fellowship -- into a commercial system that can power various enterprise applications in areas as diverse as information integration, compliance reporting, or metadata management. This company is exploiting the IP created in the patent GB1319252.1 that is also listed as an outcome of the MaSI3 fellowship. 
Year Established 2017 
Impact The company has just started so it does not have major impacts yet.
Website http://oxfordsemantic.tech