RealPDBs: Realistic Data Models and Query Compilation for Large-Scale Probabilistic Databases

Lead Research Organisation: University of Oxford

Department Name: Computer Science

Abstract

In the recent years, there has been a strong interest in academia and industry in building large-scale probabilistic knowledge bases from data in an automated way, which has resulted in a number of systems, such as DeepDive, NELL, Yago, Freebase, Microsoft's Probase, and Google's Knowledge Vault. These systems continuously crawl the Web and extract structured information, and thus populate their databases with millions of entities and billions of tuples. To what extent can these search and extraction systems help with real-world use cases? This turns out to be an open-ended question. For example, DeepDive is used to build knowledge bases for domains such as paleontology, geology, medical genetics, and human movement. From a broader perspective, the quest for building large-scale knowledge bases serves as a new dawn for artificial intelligence research. Fields such as information extraction, natural language processing (e.g., question answering), relational and deep learning, knowledge representation and reasoning, and databases are taking initiative towards a common goal. Querying large-scale probabilistic knowledge bases is commonly regarded to be at the heart of these efforts.

Beyond all these success stories, however, probabilistic knowledge bases still lack the fundamental machinery to convey some of the valuable knowledge hidden in them to the end user, which seriously limits their potential applications in practice. These problems are rooted in the semantics of (tuple-independent) probabilistic databases, which are used for encoding most probabilistic knowledge bases. For computational efficiency reasons, probabilistic databases are typically based on strong, unrealistic completeness assumptions, such as the closed-world assumption, the tuple-independence assumption, and the lack of commonsense knowledge. These strong unrealistic assumptions do not only lead to unwanted consequences, but also put probabilistic databases on weak footing in terms of knowledge base learning, completion, and querying. More specifically, each of the above systems encodes only a portion of the real world, and this description is necessarily incomplete; these systems continuously crawl the Web, encounter new sources, and consequently new facts, leading them to add such facts to their database. However, when it comes to querying, most of these systems employ the closed-world assumption, i.e., any fact that is not present in the database is assigned the probability 0, and thus assumed to be impossible. As a closely related problem, it is common practice to view every extracted fact as an independent Bernoulli variable, i.e., any two facts are probabilistically independent. For example, the fact that a person starred in a movie is independent from the fact that this person is an actor, which is in conflict with the fundamental nature of the knowledge domain. Furthermore, current probabilistic databases lack (in particular ontological) commonsense knowledge, which can often be exploited in reasoning to deduce implicit consequences from data, and which is often essential for querying large-scale probabilistic databases in an uncontrolled environment such as the Web.

The main goal of this proposal is to enhance large-scale probabilistic databases (and so to unlock their full data modelling potential) by more realistic data models, while preserving their computational properties. We are planning to develop different semantics for the resulting probabilistic databases and analyse their computational properties and sources of intractability. We are also planning to design practical scalable query answering algorithms for them, especially algorithms based on knowledge compilation techniques, extending existing knowledge compilation approaches and elaborating new ones, based on tensor factorisation and neural-symbolic knowledge compilation. We will also produce a prototype implementation and experimentally evaluate the proposed algorithms.

Planned Impact

We are proposing to lay the foundations for a new generation of probabilistic database systems that will revolutionise how we deal with probabilistic data, and unlock their full data modelling potential. As a special kind of Big Data, probabilistic data are being produced by an increasing number of applications, devices, and users, and one of their main challenges is how to deal with their incompleteness, most notably in the context of the World Wide Web. The commercial value of probabilistic data management is also reflected by the fact that the company Lattice.io, which grew out of the probabilistic database system DeepDive, has just been acquired by Apple for $175 million to $200 million. Big Data in general are critically important for a variety of different areas in science, industry, governments, and healthcare; their economic potential is enormous, estimated to exceed £50B annually in the UK (over 2.5% of the entire UK GDP). The beneficiaries could, in the long term, include anyone who uses or depends on probabilistic data, such as those collected on the Web. In the Western world at least, this effectively includes every business/organisation and every individual.

In the shorter term, the techniques for uncertainty and incompleteness handling to be developed in this project will exert a major influence on the theory and practice of probabilistic data management and of ontological data access in a more traditional database and information system context, both within and outside the academic community. Thus, our work will be of benefit to all those working in the broad area of information systems, including researchers - both in academia and industry - in the fields of databases and ontologies. Specifically, in the context of Web data, our work will allow to deal with uncertainty and incompleteness in knowledge bases that result from the ontology-based extraction and integration of probabilistic data from the Web. Thus, other short-term beneficiaries will include researchers in both academia and industry working on the extraction and integration of probabilistic data from the Web, as well as on query answering from the resulting knowledge bases.

Unsurprisingly, a strong interest in the project is coming from companies working in these areas, like our project partner Wrapidity. One business case that Wrapidity is interested in is understanding and extracting business information from Dark Data, e.g., to analyse investment trends in the region. This requires a (currently unavailable) scalable, expressive, and flexible reasoning on especially uncertain and incomplete data. Since incompleteness and ontological commonsense handling in probabilistic databases includes as special cases uncertainty and incompleteness handling in single and distributed ontologies, other short-term beneficiaries are researchers and developers dealing with large uncertain and incomplete ontologies.

In the longer term, beneficiaries will be data analysts working with probabilistic data who need to handle incomplete data, as well as researchers and developers who need to exploit probabilistic data in applications. Hence, our work will also be of benefit to researchers in science, industry, governments, and healthcare in such diverse areas as, e.g., biology, medicine, geography, astronomy, agriculture, and aerospace. For example, together with a natural language interface, the realistic probabilistic data models and their scalable query answering techniques to be developed in this project will in the long term pave the way for powerful question answering systems in healthcare or even for medical diagnosis.

The project will also be of wider benefit to the UK's research community by answering important open questions, contributing to the UK's research base, and helping to further cement the UK's world leadership in this research area. In summary, the project will contribute to enhance the UK's scientific relevance and excellence.

Funded Value:

£781,286

Funded Period:

Dec 17 - Feb 22

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/R013667/1

Principal Investigator:

Thomas Lukasiewicz

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (10%)

Information & Knowledge Mgmt (90%)

Organisations

People	ORCID iD
Thomas Lukasiewicz (Principal Investigator)
Dan Olteanu (Co-Investigator)
Georg Gottlob (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Abboud R (2022) Approximate weighted model integration on DNF structures in Artificial Intelligence

Abboud R (2020) Learning to Reason: Leveraging Neural Networks for Approximate DNF Counting in Proceedings of the AAAI Conference on Artificial Intelligence

Abboud R (2019) Learning to Reason: Leveraging Neural Networks for Approximate DNF Counting

Abboud R (2021) The Surprising Power of Graph Neural Networks with Random Node Initialization

Abboud R. (2020) Learning to reason: Leveraging neural networks for approximate dnf counting in AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

Amador-Domínguez E (2021) An ontology-based deep learning approach for triple classification with out-of-knowledge-base entities in Information Sciences

Amarilli A (2022) The Dichotomy of Evaluating Homomorphism-Closed Queries on Probabilistic Graphs in Logical Methods in Computer Science

Amarilli A. (2020) A dichotomy for homomorphism-closed queries on probabilistic graphs in Leibniz International Proceedings in Informatics, LIPIcs

Anelli V (2020) Combining RDF and SPARQL with CP-theories to reason about preferences in a Linked Data setting in Semantic Web

Antoine Amarilli (2020) A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs

Borgwardt S (2018) Recent Advances in Querying Probabilistic Knowledge Bases

Borgwardt S (2019) Ontology-Mediated Query Answering over Log-Linear Probabilistic Data in Proceedings of the AAAI Conference on Artificial Intelligence

Borgwardt S. (2019) Ontology-mediated query answering over log-linear probabilistic data in 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019

Borgwardt S. (2019) Ontology-mediated query answering over log-linear probabilistic data (Abstract) in CEUR Workshop Proceedings

Camburu O (2019) Can I Trust the Explainer? Verifying Post-hoc Explanatory Methods

Camburu O (2020) Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations

Camburu O.-M. (2020) Make up your mind! adversarial generation of inconsistent natural language explanations in Proceedings of the Annual Meeting of the Association for Computational Linguistics

Ceylan I (2021) Open-world probabilistic databases: Semantics, algorithms, complexity in Artificial Intelligence

Ceylan I (2020) Explanations for Negative Query Answers under Existential Rules

Ceylan I (2018) Reasoning Web. Learning, Uncertainty, Streaming, and Scalability - 14th International Summer School 2018, Esch-sur-Alzette, Luxembourg, September 22-26, 2018, Tutorial Lectures

Ceylan I (2019) Explanations for Query Answers under Existential Rules

Ceylan I.I. (2022) Query Answer Explanations under Existential Rules in CEUR Workshop Proceedings

Eleonora Giunchiglia (2020) Coherent Hierarchical Multi-Label Classification Networks

Elvira Amador-Domi´nguez (2021) An Ontology-Based Deep Learning Approach for Triple Classification with Out-of-Knowledge-Base Entities in Information Sciences

Fadahunsi OS (2022) Angiotensin converting enzyme inhibitors from medicinal plants: a molecular docking and dynamic simulation approach. in In silico pharmacology

Finzi A (2020) Partially observable game-theoretic agent programming in Golog in International Journal of Approximate Reasoning

Georg Gottlob (2021) Stable Model Semantics for Guarded Existential Rules and Description Logics: Decidability and Complexity in Journal of the ACM

Giunchiglia E (2021) Multi-Label Classification Neural Networks with Hard Logical Constraints in Journal of Artificial Intelligence Research

Giunchiglia E (2023) ROAD-R: the autonomous driving dataset with logical requirements in Machine Learning

Gottlob G (2021) Stable Model Semantics for Guarded Existential Rules and Description Logics: Decidability and Complexity in Journal of the ACM

Hohenecker P (2018) Ontology Reasoning with Deep Neural Networks

Hohenecker P (2020) Ontology Reasoning with Deep Neural Networks in Journal of Artificial Intelligence Research

Hohenecker P. (2020) Ontology reasoning with deep neural networks in IJCAI International Joint Conference on Artificial Intelligence

Ismail Ilkan Ceylan (2021) Preferred Explanations for Ontology-Mediated Queries under Existential Rules

Ismail Ilkan Ceylan (2020) Explanations for Ontology-Mediated Query Answering in Description Logics

Jang M (2022) NoiER: An Approach for Training More Reliable Fine-Tuned Downstream Task Models in IEEE/ACM Transactions on Audio, Speech, and Language Processing

Kayser M (2021) e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks

Kayser M (2022) Medical Image Computing and Computer Assisted Intervention - MICCAI 2022 - 25th International Conference, Singapore, September 18-22, 2022, Proceedings, Part V

Kocijan V (2019) WikiCREM: A Large Unsupervised Corpus for Coreference Resolution

Kocijan V (2019) A Surprisingly Robust Trick for Winograd Schema Challenge

Kocijan V (2019) WikiCREM: A Large Unsupervised Corpus for Coreference Resolution

Kocijan V (2019) A Surprisingly Robust Trick for the Winograd Schema Challenge

Kocijan V. (2019) WikiCrem: A large unsupervised corpus for coreference resolution in EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference

Li B (2022) Clustering Generative Adversarial Networks for Story Visualization

Li B (2022) Learning to Model Multimodal Semantic Alignment for Story Visualization

Li Y (2023) Hi-BEHRT: Hierarchical Transformer-Based Model for Accurate Prediction of Clinical Events Using Multimodal Longitudinal Electronic Health Records. in IEEE journal of biomedical and health informatics

Lin H (2022) Toward Knowledge as a Service (KaaS): Predicting Popularity of Knowledge Services Leveraging Graph Neural Networks in IEEE Transactions on Services Computing

Lukasiewicz T (2018) Complexity of Approximate Query Answering under Inconsistency in Datalog+/-

Lukasiewicz T (2019) Complexity of Inconsistency-Tolerant Query Answering in Datalog+/- under Cardinality-Based Repairs in Proceedings of the AAAI Conference on Artificial Intelligence

Collaboration


Description	Collaboration with Antoine Amarilli
Organisation	Telecom Paris
Country	France
Sector	Academic/University
PI Contribution	Joint research towards the ICDT 2020 paper "A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs".
Collaborator Contribution	Joint research towards the ICDT 2020 paper "A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs".
Impact	ICDT 2020 paper "A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs".
Start Year	2019


Description	Collaboration with Phokion Kolaitis
Organisation	IBM Research - Almaden
Country	United States
Sector	Private
PI Contribution	Collaboration has just started.
Collaborator Contribution	Collaboration has just started.
Impact	Collaboration has just started.
Start Year	2021

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications