MaSI3: A Massively Scalable Intelligent Information Infrastructure

Lead Research Organisation: University of Oxford
Department Name: Computer Science

Abstract

Ontology-based Data Management Systems (ODMSs) are a new kind of data management systems specifically designed to manage large semi-structured data sets needed to power modern intelligent applications. Most ODMSs are based on the Resource Description Framework (RDF) data model, which was specifically designed for the representation of semi-structured data. RDF data sets consist of triples, and RDF data sets are often seen as graphs with labelled vertices and edges. The structure of RDF data is described using an ontology - a set of logical axioms that give semantics to the graph, and enable the derivation of new triples via reasoning. The ontology is often expressed in the Web Ontology Language (OWL), sometimes extended with the Semantic Web Rule Language (SWRL). The main task of an ODMS is to answer queries over the given ontology and data set, with the queries commonly being expressed in the SPARQL language. Reasoning plays a key role in query answering, and modern intelligent applications commonly require an integration of taxonomic, spatio-temporal, mereological, and other kinds of reasoning.

ODMSs can and do exploit implementation techniques described in the database literature. The computational problems that such systems need to solve, however, are very hard, so developing robustly scalable systems is extremely challenging, usually requiring a combination heuristics and careful engineering. Although significant progress has been made and state of the art ODMSs can now deal with nontrivial data sets, their performance still falls far short of what is required by modern `data hungry' applications. This is partly due to the sheer size of the data sets that need to be processed, but also partly due to the complexity of the reasoning tasks that need to be performed.

Critical to the performance of ODMSs is the fact that the units of data that they store (i.e., triples) are very small so, to retrieve useful information, typical queries tend to be quite large. Efficiently answering such queries requires exhaustive data indexing; however, building and maintaining these indices can itself compromise scalability, particularly during update-intensive tasks such as materialisation-based reasoning. Moreover, although query evaluation is subpolynomial in data size, it is NP-hard in query size, so techniques that are effective on small queries may fail on large and complex queries. Finally, scaling ODMSs to deal with Big Data will inevitably require distributed data storage and query processing, but existing data partitioning schemes are unlikely to fully exploit the potential for parallelisation and minimise distributed processing on large queries.

Due to these issues, be believe that the robust scalability required by modern ODMS applications can only be achieved through the principled application of techniques that provide provable performance and/or tractability guarantees. The use of such techniques will not only allow for better and more consistent performance, but will also help ODMS users to better understand and thus avoid performance bottlenecks. We plan to develop the relevant techniques by synthesising and extending the results from three distinct research fields: databases, knowledge representation, and mathematical network theory. Combining these techniques with insightful engineering and extensive optimisation will, we believe, allow us to implement a new ODMS with
scalability surpassing that of existing systems by several orders of magnitude. Finally, we will exploit our contacts with industry (see enclosed Letters of Support) to evaluate and tune our ODMS in real-world settings. We will thus lay both the theoretical and the practical foundations for a massively scalable intelligent information infrastructure capable of powering modern data-intensive applications.

Planned Impact

* Academic Impact

We believe that the techniques developed in this project will exert a major influence on the theory and practice of data management and reasoning in several academic communities. As explained in the Case for Support and in line with EPSRC's `Working Together' priority, addressing the technical challenges inherent in intelligent management of large volumes of data requires a collaboration with researchers within and outside ICT. Within ICT, we expect a strong cross-fertilisation of ideas between the knowledge representation and reasoning community on the one side, and the database community on the other side. Outside ICT, solving the problems related to data partitioning will require a collaboration with researchers in the mathematical network theory. Through these collaborations, this project has the potential to shape the research agenda in knowledge representation and reasoning, databases, and mathematics, contributing new ideas and uncovering challenges for future work. This will contribute to expanding the UK's research base, and to a consolidation of the UK's established world leadership in the mentioned research areas.

* Commercial Impact

Various companies have already recognised ODMSs as a great commercial opportunity. For example, numerous start-ups and small companies in the UK, the EU, and the USA (such as Garlik, ExperienceOn, ontoprise, OpenLink, Clark&Parsia, OntoText, Metaweb, and fluidOps) are currently developing ODMS variants. Well-known providers of data management infrastructure have also recognised the need to support RDF and OWL; for example, Oracle has recently enhanced its well-known database management system with modules that use ontologies to support `semantic data management'. Although companies such as Oracle see a big market in the application of ontology-based technologies, their existing systems suffer from numerous limitations. Thus, addressing the scalability problems outlined in this proposal would have a significant impact on the business of these companies.

* Dissemination and Engagement

We will undertake a range of activities in order to ensure the widest possible dissemination of our results and engagement with anticipated beneficiaries.

First, we will continue our established pattern of publishing our research in international journals and conferences. Our publications have appeared in top journals such as JACM, AI, JAIR, JWS, VLDB Journal, and Information & Computation, as well as leading conferences such as IJCAI, KR, the ISWC, and IJCAR.

Second, we will continue our participation in relevant international coordination and standardisation efforts within groups and organisations such as the World Wide Web consortium and the OWL Experiences and Directions Group. Through these activities we can foster awareness of our work and ensure that it has the maximum possible impact on any future standards. For example, the W3C's OWL 2 ontology language standard is based on our work on description logics.

Third, we will continue our collaboration with the developers of ontology-based systems and applications in both academia and industry, including, for example, BAE Systems, ExperienceOn, and Samsung (see Letters of Support). As well as providing a channel for dissemination, our industry contacts will provide excellent opportunities for commercialising the results of this project.

Fourth, we will make all project outputs available from the project web site, including papers, presentations, tutorials, and software.

Publications

10 25 50
 
Description We developed several novel algorithms for the management of RDF data. These include algorithms for computing the materialisation of datalog programs with and without equality in main-memory RDF stores, and algorithms for the incremental maintenance of such materialisations (i.e., of computing how to update the materialisation if only a small fraction of the input changes). We have implemented these techniques in our RDFox data management system and evaluated them against the state of the art. We showed that our techniques considerably outperform related techniques known in the literature. In particular, on an high-end Oracle server we obtained inference rates unparalleled in the literature.

In addition, we have developed a novel technique for answering SPARQL queries in distributed RDF systems. The technique is quite different from what is commonly found in federated database systems. We are still evaluating our technique, but the results of our initial performance comparison are very encouraging.
Exploitation Route Our techniques can be used by all RDF management systems that employ materialisation as a reasoning technique. Furthermore, we are working with Oracle on exploring ways of incorporating these techniques in their systems.
Sectors Digital/Communication/Information Technologies (including Software)

 
Description We have submitted a patent application describing the design of some of the key components of our system. We have also published a number of papers describing various forms of new reasoning techniques. I have been collaborating with various companies. As a prominent example of such collaboration, Anthony Potter, a PhD student of mine working on distributed querying techniques closely related with this project, visited Oracle Corporation in 2015 on a four-month internship. During his stay in California, he got Oracle interested to the extent that they implemented Anthony's algorithm in their graph database and are currently evaluating the extent to which they will include the algorithm into their product. Finally, I started two spinout companies: Covatic and Oxford Semantic Technologies. The specific aim of the latter company is to bring RDFox (the system developed in this project) to market. The company has raised considerable investment and is currently employing three people full time.
First Year Of Impact 2015
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Description KE Seed Fund Grant
Amount £3,000 (GBP)
Organisation University of Oxford 
Sector Academic/University
Country United Kingdom
Start 01/2016 
End 01/2016
 
Description Oracle External Research Office grant
Amount $95,000 (USD)
Organisation Oracle Corporation 
Sector Private
Country United States
Start 03/2016 
End 03/2017
 
Description University of Oxford / Impact Acceleration Award
Amount £53,786 (GBP)
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 01/2016 
End 12/2016
 
Description University of Oxford / Impact Acceleration Award
Amount £30,269 (GBP)
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 04/2015 
End 09/2015
 
Description Armasuisse collaboration 
Organisation Federal Office for Defence Procurement Armasuisse
Country Switzerland 
Sector Public 
PI Contribution We collaborated with Armasuisse on applying semantic technologies to the problem of detecting events on Twitter. The collaboration resulted in a paper that will be published at ESWC 2017. Apart from Armasuisse, the University of Fribourg also collaborated on the project as well; however, Armasuisse was the main project partner.
Collaborator Contribution Armasuisse provided the use case, the data for the evaluation, and the expertise in analysing Twitter time series data. Their contribution was crucial to getting the ESWC 2017 paper into shape.
Impact ESWC 2017 paper called "ArmaTweet: Detecting Events by Semantic Tweet Analysis". The paper is yet to be published, so the bibliographic details are not yet complete.
Start Year 2016
 
Description Oracle 
Organisation Oracle Corporation
Department Oracle Corporation UK Ltd
Country United Kingdom 
Sector Private 
PI Contribution Anthony Potter, a PhD student in the department, is working on distributed query answering algorithms. In 2015 he visited Oracle on a four-month internship. During the internship, Oracle has decided to implement Anthony's algorithm in their graph database. They also decided to support further research on semantic technologies through their External Researcher Programme.
Collaborator Contribution Oracle are supporting the research in semantic technologies with an unrestricted grant of $95k/year.
Impact Oracle implemented the distributed query answering algorithm in their system and is planning to use it in practice.
Start Year 2014
 
Title Parallel materialisation of a set of logical rules on a logical database 
Description This invention concerns the materialisation of a set of logical rules on a logical database, such as a Resource Description Framework (RDF) database. More particularly, but not exclusively, the invention concerns computer-implemented methods of providing the materialisation of a set of logical rules on a logical database that are particularly amenable to parallel execution. The invention also concerns methods of storing data in computer memory when executing such methods. 
IP Reference GB1319252.1 
Protection Patent application published
Year Protection Granted 2014
Licensed No
Impact The technology described in this patent provides the foundation for RDFox -- a software system (listed as output of the MaSI3 grant) for scalable management of RDF data. The University and the PI recently started two spinout companies -- Covatic and Oxford Semantic Technologies -- whose goal is to further develop RDFox and use it in a commercial setting. Both companies are listed as outputs of the MaSI3 fellowship.
 
Title RDFox 
Description Triple store / graph DB 
Type Of Technology Software 
Year Produced 2016 
Impact Basis for Covatic and OST spin-outs 
URL https://www.cs.ox.ac.uk/isg/tools/RDFox/
 
Company Name Covatic Ltd 
Description Covatic aims to utilise semantic technology and linked data developed at the University of Oxford to build the world's first true personalisation engine that will enable broadcasters to deliver context aware, dynamic programming uniquely to each audience member, representing an unparalleled consumer experience. This company is exploiting the IP created in the patent GB1319252.1 that is also listed as an outcome of the MaSI3 fellowship. 
Year Established 2017 
Impact The company is just starting in February 2017, so there are no major impacts yet. However, the company has a partnership with ITN that will guide the development of the products.
Website http://www.covatic.com
 
Company Name Oxford Semantic Technologies Ltd 
Description The company aims to convert RDFox -- a major output of the MaSI3 fellowship -- into a commercial system that can power various enterprise applications in areas as diverse as information integration, compliance reporting, or metadata management. This company is exploiting the IP created in the patent GB1319252.1 that is also listed as an outcome of the MaSI3 fellowship. 
Year Established 2017 
Impact The company has just started so it does not have major impacts yet.
Website http://oxfordsemantic.tech