Coalgebraic Foundations of Semi-Structured Data

Lead Research Organisation: University of Strathclyde

Department Name: Computer and Information Sciences

Abstract

Databases are irreplaceable in our modern information society. Classically, information has been stored in rigidly structured databases employing the relational data model and the query language SQL. This has led to highly optimised relational database management systems such as Oracle or Mircrosoft SQL that have large-scale industrial deployment. Underpinning much of the success of these systems has been their close connection with their mathematical foundations within relational algebra and mathematical logic. However, the internet has dramatically changed our understanding of data and databases: i) data on the Web is inherently hierarchical and arranged in a network/graph; and ii) the decentralised nature of the Web means that data comes from various heterogenous sources, is often incomplete or unreliable and has no uniform structure. Still, Web data retains some structure and, consequently, semi-structured data seeks to understand what structure persists and how it can be utilised. Examples of widespread, industrially-used data models are XML, JSON and RDF.

Mathematically, semi-structured data is usually represented as unranked labelled trees where nodes are accessed via the parent, child and sibling relations or labelled graphs where nodes are accessed via the edge relation. Data elements are stored at the nodes. Due to the special features of Web data mentioned above, query languages for semi-structured data face significantly greater challenges compared to those for data stored in a relational database: i) queries must navigate the path structure within a tree or graph and query the data elements found along such paths to explore important properties of the data; and ii) one must add a common vocabulary to the data - often via an ontology - that provides a logical layer for integrating semantically related data from heterogeneous sources. But as semi-structured data models have become increasingly sophisticated and expressive - e.g. in order to model the uncertainty attached to the data - the development of matching query and ontology languages have struggled to keep pace. Indeed, while there is a large body of work on specific semi-structured data models and their corresponding query languages there is currently no comprehensive theory that both accounts for existing semi-structured data formats and their query languages, and is able to guide their extension to the next generation of semi-structured data formats. For example, while the widely used query language XPath has been recently extended from XML to graph data, there is currently no agreed mechanism to further extend XPath to graph data with uncertainty.

Our central insight is that coalgebra provides the right level of abstraction to underpin a comprehensive theory of query and ontology languages for semi-structured data. This is because i) coalgebra generalises - via coalgebraic modal logic - the usual modal logics used for traversing trees and graphs; and ii) coalgebra generalises - via coalgebraic logic programming - standard rule-based ontology languages which are based upon logic programming.

Planned Impact

Beneficiaries: The growth of the amount of digitally stored data is so rapid that it is sometimes referred to as the data explosion. In 2012 the amount of data generated every day was a staggering 2.5 exabytes or, in other words, 2.5 billion gigabytes (GB). The potential economic value of the information stored within this mountain of data is difficult to determine - but there are estimates that predict the size of the big data and analytics market will reach $125 billion worldwide in 2015. A key tool for data access is provided by logic and this proposal will create a new generation of logics and a general framework for new query languages, in particular, for such languages that deal with probabilities and weights. This research proposal will lay the basis of a new generation of so-called ontology-based information systems. The latter already found their way into industrial applications: For example, elements of the OWL ontology language are integrated in Oracle Database 11g and Datalog-like dependencies are used in IBMs data exchange tool Clio and the importance of semantic technologies is widely recognised by industry. We expect our tools to be of particular value to smaller companies whose central purpose is the analysis of data, industrial-scale database companies (Oracle, IBM) and large-scale information providers (Google, Yahoo). Furthermore we will increase the skills base of the next generation of data researchers and disseminate our results to non-academic audiences. Concrete pathways to impact are (more details are in the attached Pathways to Impact document):

1. Algorithms & languages: The project will make important first steps on a research program that will develop new data models and tools for accessing and storing data. This will have long- term impact beyond the timeline of the project: future implementations of our data models, query languages & algorithms will form part of the next generation of information systems. To put us on a trajectory towards this, we will produce proof-of-concept implementations together with our PhD student who is already implementing a simple query language for tree-structured data.

2. Knowledge Exchange with local SMEs: The algorithms for ontology-based data access that we will develop in this proposal have the clear potential to be useful for recommender systems. In particular the extension of existing OBDA frameworks to preference-aware reasoning will be of crucial importance. In order to produce factual evidence for my claim, I will involve a Glasgow based company that has strong expertise in recommender systems.

3. Training Delivery: There is a clear need for a highly skilled, specialised work force that has been trained using the latest state-of-the-art database technologies. I will generate impact by bringing up the next generation of researchers by offering postgraduate courses, the Practical Types Network (set up by the programming languages and verification groups in Strathclyde, Dundee and St. Andrews and funded by SICSA), the Midlands Graduate School and at international summer schools such as ESSLLI or NaSSLLI.

4. Software development: In recent years both the UK and the Scottish governments made vast collections of open data accessible to the general public. The idea is that the data will help innovation by enabling software developers to use the data for the creation of new tools. Examples include tools providing travel information, information about local services. Our pathway to impact will be to explore the potential for ontology-based data access in these areas through student projects with the long-term goal to create software that makes use of the more expressive query languages obtained in our research.

In addition to that we will offer to give public lectures on our research and we will seek to widen our network of non-academic contacts through conferences with a high-level of industrial participation such as IJCAI, AAAI or SIGMOD.

Funded Value:

£99,423

Funded Period:

Feb 16 - Jan 18

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/N015843/1

Principal Investigator:

Clemens Achim Kupke

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Fundamentals of Computing (100%)

Organisations

People	ORCID iD
Clemens Achim Kupke (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Barlocco S (2019) Foundations of Software Science and Computation Structures - 22nd International Conference, FOSSACS 2019, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2019, Prague, Czech Republic, April 6-11, 2019, Proceedings

Barlocco S (2018) Logical Foundations of Computer Science

Bezhanishvili N (2016) Games for Topological Fixpoint Logic in Electronic Proceedings in Theoretical Computer Science

Enqvist S (2019) Completeness for Game Logic

Ghani N (2018) A compositional treatment of iterated open games in Theoretical Computer Science

Gottlob G (2021) Stable Model Semantics for Guarded Existential Rules and Description Logics: Decidability and Complexity in Journal of the ACM

Hansen H (2018) Dynamic Logic. New Trends and Applications

Hansen H.H. (2017) Stream differential equations: Specification formats and solution methods in Logical Methods in Computer Science

Marti J (2019) A Discrete Duality Between Nonmonotonic Consequence Relations and Convex Geometries in Order

Key Findings
Further Funding
Engagement Activities


Description	We managed to obtain key results in both the mathematical theory of coalgebra and concerning database query languages. Concerning the mathematical foundations we obtained (i) automata-based algorithms for dynamic logics - the latter are closely related to graph query languages and (ii) a new learning algorithm that potentially allow to learn a large variety of different types of data. Concerning the actual query languages we made significant progress on the issue of enriching ontological queries with non-monotonic negation.
Exploitation Route	The above mentioned learning algorithm is a proof-of-concept result that leaves ample of space for applications in the near future.
Sectors	Digital/Communication/Information Technologies (including Software)


Description	SICSA Research Theme Event Funding
Amount	£1,000 (GBP)
Organisation	SICSA Scottish Informatics and Computer Science Alliance
Sector	Academic/University
Country	United Kingdom
Start	03/2017
End	04/2017


Description	datafest
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Industry/Business
Results and Impact	Participation in a 1-day workshop on Machine Learning as part of the " DataFest - Delivering Value from Data Analytics in an Industrial Environment" at the University of Strathclyde. Attended by more than 40 companies
Year(s) Of Engagement Activity	2018

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications