Coalgebraic Foundations of Semi-Structured Data

Lead Research Organisation: University of Strathclyde
Department Name: Computer and Information Sciences


Databases are irreplaceable in our modern information society. Classically, information has been stored in rigidly structured databases employing the relational data model and the query language SQL. This has led to highly optimised relational database management systems such as Oracle or Mircrosoft SQL that have large-scale industrial deployment. Underpinning much of the success of these systems has been their close connection with their mathematical foundations within relational algebra and mathematical logic. However, the internet has dramatically changed our understanding of data and databases: i) data on the Web is inherently hierarchical and arranged in a network/graph; and ii) the decentralised nature of the Web means that data comes from various heterogenous sources, is often incomplete or unreliable and has no uniform structure. Still, Web data retains some structure and, consequently, semi-structured data seeks to understand what structure persists and how it can be utilised. Examples of widespread, industrially-used data models are XML, JSON and RDF.

Mathematically, semi-structured data is usually represented as unranked labelled trees where nodes are accessed via the parent, child and sibling relations or labelled graphs where nodes are accessed via the edge relation. Data elements are stored at the nodes. Due to the special features of Web data mentioned above, query languages for semi-structured data face significantly greater challenges compared to those for data stored in a relational database: i) queries must navigate the path structure within a tree or graph and query the data elements found along such paths to explore important properties of the data; and ii) one must add a common vocabulary to the data - often via an ontology - that provides a logical layer for integrating semantically related data from heterogeneous sources. But as semi-structured data models have become increasingly sophisticated and expressive - e.g. in order to model the uncertainty attached to the data - the development of matching query and ontology languages have struggled to keep pace. Indeed, while there is a large body of work on specific semi-structured data models and their corresponding query languages there is currently no comprehensive theory that both accounts for existing semi-structured data formats and their query languages, and is able to guide their extension to the next generation of semi-structured data formats. For example, while the widely used query language XPath has been recently extended from XML to graph data, there is currently no agreed mechanism to further extend XPath to graph data with uncertainty.

Our central insight is that coalgebra provides the right level of abstraction to underpin a comprehensive theory of query and ontology languages for semi-structured data. This is because i) coalgebra generalises - via coalgebraic modal logic - the usual modal logics used for traversing trees and graphs; and ii) coalgebra generalises - via coalgebraic logic programming - standard rule-based ontology languages which are based upon logic programming.

Planned Impact

Beneficiaries: The growth of the amount of digitally stored data is so rapid that it is sometimes referred to as the data explosion. In 2012 the amount of data generated every day was a staggering 2.5 exabytes or, in other words, 2.5 billion gigabytes (GB). The potential economic value of the information stored within this mountain of data is difficult to determine - but there are estimates that predict the size of the big data and analytics market will reach $125 billion worldwide in 2015. A key tool for data access is provided by logic and this proposal will create a new generation of logics and a general framework for new query languages, in particular, for such languages that deal with probabilities and weights. This research proposal will lay the basis of a new generation of so-called ontology-based information systems. The latter already found their way into industrial applications: For example, elements of the OWL ontology language are integrated in Oracle Database 11g and Datalog-like dependencies are used in IBMs data exchange tool Clio and the importance of semantic technologies is widely recognised by industry. We expect our tools to be of particular value to smaller companies whose central purpose is the analysis of data, industrial-scale database companies (Oracle, IBM) and large-scale information providers (Google, Yahoo). Furthermore we will increase the skills base of the next generation of data researchers and disseminate our results to non-academic audiences. Concrete pathways to impact are (more details are in the attached Pathways to Impact document):

1. Algorithms & languages: The project will make important first steps on a research program that will develop new data models and tools for accessing and storing data. This will have long- term impact beyond the timeline of the project: future implementations of our data models, query languages & algorithms will form part of the next generation of information systems. To put us on a trajectory towards this, we will produce proof-of-concept implementations together with our PhD student who is already implementing a simple query language for tree-structured data.

2. Knowledge Exchange with local SMEs: The algorithms for ontology-based data access that we will develop in this proposal have the clear potential to be useful for recommender systems. In particular the extension of existing OBDA frameworks to preference-aware reasoning will be of crucial importance. In order to produce factual evidence for my claim, I will involve a Glasgow based company that has strong expertise in recommender systems.

3. Training Delivery: There is a clear need for a highly skilled, specialised work force that has been trained using the latest state-of-the-art database technologies. I will generate impact by bringing up the next generation of researchers by offering postgraduate courses, the Practical Types Network (set up by the programming languages and verification groups in Strathclyde, Dundee and St. Andrews and funded by SICSA), the Midlands Graduate School and at international summer schools such as ESSLLI or NaSSLLI.

4. Software development: In recent years both the UK and the Scottish governments made vast collections of open data accessible to the general public. The idea is that the data will help innovation by enabling software developers to use the data for the creation of new tools. Examples include tools providing travel information, information about local services. Our pathway to impact will be to explore the potential for ontology-based data access in these areas through student projects with the long-term goal to create software that makes use of the more expressive query languages obtained in our research.

In addition to that we will offer to give public lectures on our research and we will seek to widen our network of non-academic contacts through conferences with a high-level of industrial participation such as IJCAI, AAAI or SIGMOD.
Description We managed to obtain key results in both the mathematical theory of coalgebra and concerning database query languages. Concerning the mathematical foundations we obtained (i) automata-based algorithms for dynamic logics - the latter are closely related to graph query languages and (ii) a new learning algorithm that potentially allow to learn a large variety of different types of data. Concerning the actual query languages we made significant progress on the issue of enriching ontological queries with non-monotonic negation.
Exploitation Route The above mentioned learning algorithm is a proof-of-concept result that leaves ample of space for applications in the near future.
Sectors Digital/Communication/Information Technologies (including Software)

Description SICSA Research Theme Event Funding
Amount £1,000 (GBP)
Organisation SICSA Scottish Informatics and Computer Science Alliance 
Sector Academic/University
Country United Kingdom
Start 03/2017 
End 04/2017
Description datafest 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Industry/Business
Results and Impact Participation in a 1-day workshop on Machine Learning as part of the " DataFest - Delivering Value from Data Analytics in an Industrial Environment" at the University of Strathclyde. Attended by more than 40 companies
Year(s) Of Engagement Activity 2018