iTract: Islands of Tractability in Ontology-Based Data Access
Lead Research Organisation:
University of Liverpool
Department Name: Computer Science
Abstract
15 years ago most data was structured, complete, and neatly organised in databases. This is no longer the case. Unstructured, incomplete, and heterogeneous data sets are proliferating at an enormous rate. This is most evident in the context of the World Wide Web, but also applies to scientific data, data in business and industry, data in healthcare and in many other areas. To make use of such data, traditional information systems based on standard database technologies are no longer sufficient.
Ontology-based data access and management is a novel approach to address this challenge by introducing a semantic layer (ontology) that provides the user with a high-level unified view of the data as well as a vocabulary to access and query the data. Ontologies model application domains by providing machine readable definitions of terms and relationships between them. They are already used in numerous applications, for example, by the NHS: to enable communication between health professionals within the United Kingdom and worldwide, it is crucial that they use the same terminology; such a terminology is provided by the ontology SNOMED CT.
Using ontologies to access data and thereby directly combining data and knowledge is a novel idea of the 21st century. First applications have demonstrated that ontology-based data access and management is indeed feasible and has the potential to revolutionise modern information systems. However, scalability of query answering with expressive ontology languages remains a big challenge, and it is the aim of this project to develop a new "island of tractability" approach to tackle it. Our approach links ontology-based data access with two well-established and successful areas of Computer Science: constraint satisfaction and Boolean circuit complexity. We aim to transfer proof methods, techniques, and methodologies from these two areas to ontology-based data access. This includes a non-uniform complexity analysis, where we aim to classify the complexity of answering ontology-mediated queries, which consist of an ontology and a standard database query. Based on this complexity analysis, we will develop uniformly efficient query answering algorithms for the identified islands of tractable ontology-mediated queries, and implement them in the ontology-based data access systems Ontop and Combo. We will apply our novel technology to case studies from oil and gas industry and healthcare.
Ontology-based data access and management is a novel approach to address this challenge by introducing a semantic layer (ontology) that provides the user with a high-level unified view of the data as well as a vocabulary to access and query the data. Ontologies model application domains by providing machine readable definitions of terms and relationships between them. They are already used in numerous applications, for example, by the NHS: to enable communication between health professionals within the United Kingdom and worldwide, it is crucial that they use the same terminology; such a terminology is provided by the ontology SNOMED CT.
Using ontologies to access data and thereby directly combining data and knowledge is a novel idea of the 21st century. First applications have demonstrated that ontology-based data access and management is indeed feasible and has the potential to revolutionise modern information systems. However, scalability of query answering with expressive ontology languages remains a big challenge, and it is the aim of this project to develop a new "island of tractability" approach to tackle it. Our approach links ontology-based data access with two well-established and successful areas of Computer Science: constraint satisfaction and Boolean circuit complexity. We aim to transfer proof methods, techniques, and methodologies from these two areas to ontology-based data access. This includes a non-uniform complexity analysis, where we aim to classify the complexity of answering ontology-mediated queries, which consist of an ontology and a standard database query. Based on this complexity analysis, we will develop uniformly efficient query answering algorithms for the identified islands of tractable ontology-mediated queries, and implement them in the ontology-based data access systems Ontop and Combo. We will apply our novel technology to case studies from oil and gas industry and healthcare.
Planned Impact
The "Big Data Revolution" has been identified by the UK Government as one of 8 technologies that will propel the UK to future growth. Within this challenge, users (e.g., data analysts) will require a semantic framework that enables them to access and query large, unstructured, incomplete and heterogeneous data sets. It is the main goal of ontology-based data access (OBDA) and the iTract project to provide such a framework. iTract (together with the OBDA research and user community) aims to transform the way in which Data, Information, and Knowledge interact. We propose to use knowledge (ontologies) directly to access data. Because of the inherent complexity of querying knowledge (as opposed to data), this seemed impossible 10 years ago, but the progress made in recent years and the transformation to OBDA we propose in the iTract project have a realistic chance to overcome this problem.
Consequently, in the long term, the beneficiaries of our research will include every user of information systems that have to deal with unstructured, incomplete and heterogeneous data sets, in particular, the users of the World Wide Web; in the shorter term, the beneficiaries will include researchers in both academia and industry who aim to develop the new generation of information systems, as well as those companies and organisations who have already decided to use OBDA for querying their data sets because the traditional ways are inefficient or impossible.
Our direct actions to engage potential users in industry, business, and healthcare include the following:
- Members of the iTract team will take part in the annual demonstration and evaluation sessions organised by our partner in Oslo with the Norwegian oil and gas industry.
- The software produced by iTract will be presented at workshops and in tutorials attended by potential users from industry, business, and health care. The software produced by iTract will be made publicly available on the Web.
- We will promote our OBDA systems to potential users within industry through the network established by our partner in Oslo.
- We will apply our systems to query medical records using the SNOMED CT ontology in the framework of the original case study developed by IBM Watson in 2007. Our aim is to demonstrate to potential users the efficiency of our technologies in the area of health care.
- Building on the results of the previous step, we will contact the UK Terminology Centre (UKTC) to discuss applications of OBDA for SNOMED CT in the UK.
Consequently, in the long term, the beneficiaries of our research will include every user of information systems that have to deal with unstructured, incomplete and heterogeneous data sets, in particular, the users of the World Wide Web; in the shorter term, the beneficiaries will include researchers in both academia and industry who aim to develop the new generation of information systems, as well as those companies and organisations who have already decided to use OBDA for querying their data sets because the traditional ways are inefficient or impossible.
Our direct actions to engage potential users in industry, business, and healthcare include the following:
- Members of the iTract team will take part in the annual demonstration and evaluation sessions organised by our partner in Oslo with the Norwegian oil and gas industry.
- The software produced by iTract will be presented at workshops and in tutorials attended by potential users from industry, business, and health care. The software produced by iTract will be made publicly available on the Web.
- We will promote our OBDA systems to potential users within industry through the network established by our partner in Oslo.
- We will apply our systems to query medical records using the SNOMED CT ontology in the framework of the original case study developed by IBM Watson in 2007. Our aim is to demonstrate to potential users the efficiency of our technologies in the area of health care.
- Building on the results of the previous step, we will contact the UK Terminology Centre (UKTC) to discuss applications of OBDA for SNOMED CT in the UK.
Publications
Artale A.
(2017)
Ontology-mediated query answering over temporal data: A survey
in Leibniz International Proceedings in Informatics, LIPIcs
Bienvenu M.
(2016)
First order-rewritability and containment of conjunctive queries in horn description logics
in IJCAI International Joint Conference on Artificial Intelligence
Botoeva E
(2019)
Query Inseparability for ALC Ontologies
Botoeva E
(2019)
Query inseparability for ALC ontologies
in Artificial Intelligence
Botoeva E.
(2016)
Query-based entailment and inseparability for ALC ontologies
in IJCAI International Joint Conference on Artificial Intelligence
Hernich A
(2020)
Dichotomies in Ontology-Mediated Querying with the Guarded Fragment
in ACM Transactions on Computational Logic
Hernich A
(2017)
Dichotomies in Ontology-Mediated Querying with the Guarded Fragment
Hernich A
(2018)
Dichotomies in Ontology-Mediated Querying with the Guarded Fragment
Description | We have discovered and investigated many new and important islands of tractability in ontology-mediated query answering, including the following: (1) We have considered the case of ontology-mediated querying with expressive data types such as the integers, the rational numbers, or related spatial and temporal data types. Using recent results on P/NP dichotomies for temporal constraint satisfaction problems, we obtained P/coNP dichotomies for ontology-mediated querying with datatypes. Moreover, in many cases, membership to the tractable class is decidable. Sometimes this can even be done using a straightforward syntactic check. This work was published, for example, in AAAI 2017. (2) We considered ontologies over the guarded fragment of first-order logic and determined very expressive fragments for which there exists a P/NP dichotomy for ontology-mediated query answering. In many practically relevant cases we obtained NExpTime or ExpTime decision procedures for deciding whether an ontology-mediated query is tractable, thus identifying important relevant classes of queries for which PTime querying is possible. The also proved dichotomies between datalog-rewritable and coNP-hard. This work received the Best Paper Award at PODS 2017. We investigated the relationship between ontology-mediated query answering using unions of conjunctive queries and ontology-mediated query answering using SPARQL queries. We developed criteria and decision procedures when the former can be reduced to the later type of queries. This research is practically relevant as many implemented systems are based on SPARQL queries. Work on this received the Distinguished Paper Award at IJCAI 2018. We investigated the question whether all tractable ontology-mediated queries can be rewritten into queries based on Horn ontologies, presenting both positive and negative results. We also gave decision procedures for containment and first-order rewritability of ontology-mediated queries over Horn ontologies. This work was presented at IJCAI 2016 and IJCAI 2018. (3) We gave solutions to two fundamental computational problems in ontology-based data access with the W3C standard ontology language OWL2QL: the succinctness problem for first-order rewritings of ontology-mediated queries, and the complexity problem for ontology-mediated query answering. We classified ontology-mediated queries according to the shape of their conjunctive queries (treewidth, the number of leaves) and the existential depth of their ontologies. For each of these classes, we determined the combined complexity of ontology-mediated query answering, and whether all ontology-mediated queries in the class have polynomial-size first-order, positive existential, and nonrecursive data- log rewritings. We obtain the succinctness results using hypergraph programs, a new computational model for Boolean functions, which makes it possible to connect the size of ontology-mediated query rewritings and circuit complexity. This work was published in LICS 2014, 2015 and the Journal of ACM 2018. We extended this analysis to ontology-mediated queries with sets of linear tgds and conjunctive queries of bounded hypertree width. We also investigated parameterised complexity of answering tree-shaped ontology-mediated queries in OWL 2 QL under various restrictions on their ontologies and conjunctive queries. In particular, we construct an ontology T such that answering ontology-mediated queries (T,q) with tree-shaped query q is W[1]-hard if the number of leaves in q is regarded as the parameter. The number of leaves has previously been identified as an important characteristic of conjunctive queries as bounding it leads to tractable ontology-mediated query answering. This work was presented at PODS 2017. (4) We have investigated islands of tractability in temporal ontology-based data access with the linear temporal logic LTL, Halpern-Shoham interval temporal logic HS and metric temporal logic MTL. This work was published in JAIR 2018, ACM TOCL 2017 and presented at IJCAI 2016, AAAI 2017, TIME 2017. |
Exploitation Route | We expect that our findings will be used in ontology-based query answering settings in both academia and industry. |
Sectors | Digital/Communication/Information Technologies (including Software) Education Energy Healthcare Culture Heritage Museums and Collections |
Description | Modern organisations accumulate vast amounts of data, stored in multiple and complex databases. Extracting data is a time-consuming and onerous process, especially for non-IT specialists. Virtual Knowledge Graphs (VKGs) provide users with a search vocabulary that facilitates information extraction without relying on IT specialists, leading to cost/efficiency savings and opening up data repositories to data analytics. Our research underpins the reasoning algorithms in the VKG system Ontop, which has applications across a wide range of sectors including energy, healthcare, education and innovation. Ontop is available open-source on Github (github.com/ontop/ontop) and has become 'one of the leading Virtual Knowledge Graph systems worldwide'. It has been bundled with downloads of Stanford University's Protégé, an ontology development platform with over 366,000 users (Dec 2020). In April 2019, UniBZ spun out the Ontop work into a start-up company, Ontopic s.r.l., which now employs three full-time staff who work alongside UniBZ academics to develop tailored commercial solutions based on the Ontop framework. In addition to the economic benefit to those employed, the development of spin-out companies such as Ontopic serves UniBZ's goals as an institution: 'technology and knowledge transfer is the third pillar of the university... joint projects ensure the practical relevance of research and education'. (www.unibz.it/en/home/companies-and-partnerships/knowledge-technology-transfer). Together with the Norwegian SIRIUS Centre for Scalable Data Access in the Oil and Gas Domain, we worked with the multinational energy firm Equinor (formerly Statoil) and with German industrial manufacturing conglomerate Siemens to develop exemplar VKG tools, in which Ontop was a core component. In Brazil, Ontop forms the basis for a VKG system named Recruit, implemented at the AC Camargo Cancer Centre in São Paulo. VKGs have proven useful in providing access to open data repositories which facilitate the smoother running of regional infrastructure. Ontopic's ongoing projects in this sector include an €80,000 collaboration with the Italian province of South Tyrol to extend their tourism open data portal. In Spain, SIRIS Academic (a consultancy and think-tank based in Barcelona which employs over 30 staff ) has drawn on Ontop's VKG technology to provide information solutions for its clients, describing Ontop as 'indispensable' to its work. SIRIS's initial work with Ontop came in the context of EPNet, a €2,400,000 ERC-funded project that integrated three Roman archaeological databases into a user-friendly interface allowing scholars to easily run searches across them. Notably, Ontop underpins SIRIS's UNiCS (unics.cloud), 'an Open Data platform based on semantic technologies that integrates an ever-growing number of repositories and datasets about the higher education, research and innovation sector in Europe'. 'Approximately 10,000 users each year from institutions including universities, local governments, and regional agencies responsible for research and development' use UNiCS (and the customised portals built from it) to better understand their operating context, allowing them to make informed strategic decisions for the future. SIRIS also uses UNiCS as the basis for customised data mining applications and strategic solutions for its clients. The Ontop system is also at the core of the BT Hypercat Data Hub and in the DALI project at IBM Ireland. |
First Year Of Impact | 2017 |
Sector | Digital/Communication/Information Technologies (including Software),Education,Energy |
Impact Types | Cultural Societal Economic Policy & public services |