RealPDBs: Realistic Data Models and Query Compilation for Large-Scale Probabilistic Databases

Lead Research Organisation: University of Oxford

Department Name: Computer Science

Abstract

In the recent years, there has been a strong interest in academia and industry in building large-scale probabilistic knowledge bases from data in an automated way, which has resulted in a number of systems, such as DeepDive, NELL, Yago, Freebase, Microsoft's Probase, and Google's Knowledge Vault. These systems continuously crawl the Web and extract structured information, and thus populate their databases with millions of entities and billions of tuples. To what extent can these search and extraction systems help with real-world use cases? This turns out to be an open-ended question. For example, DeepDive is used to build knowledge bases for domains such as paleontology, geology, medical genetics, and human movement. From a broader perspective, the quest for building large-scale knowledge bases serves as a new dawn for artificial intelligence research. Fields such as information extraction, natural language processing (e.g., question answering), relational and deep learning, knowledge representation and reasoning, and databases are taking initiative towards a common goal. Querying large-scale probabilistic knowledge bases is commonly regarded to be at the heart of these efforts.

Beyond all these success stories, however, probabilistic knowledge bases still lack the fundamental machinery to convey some of the valuable knowledge hidden in them to the end user, which seriously limits their potential applications in practice. These problems are rooted in the semantics of (tuple-independent) probabilistic databases, which are used for encoding most probabilistic knowledge bases. For computational efficiency reasons, probabilistic databases are typically based on strong, unrealistic completeness assumptions, such as the closed-world assumption, the tuple-independence assumption, and the lack of commonsense knowledge. These strong unrealistic assumptions do not only lead to unwanted consequences, but also put probabilistic databases on weak footing in terms of knowledge base learning, completion, and querying. More specifically, each of the above systems encodes only a portion of the real world, and this description is necessarily incomplete; these systems continuously crawl the Web, encounter new sources, and consequently new facts, leading them to add such facts to their database. However, when it comes to querying, most of these systems employ the closed-world assumption, i.e., any fact that is not present in the database is assigned the probability 0, and thus assumed to be impossible. As a closely related problem, it is common practice to view every extracted fact as an independent Bernoulli variable, i.e., any two facts are probabilistically independent. For example, the fact that a person starred in a movie is independent from the fact that this person is an actor, which is in conflict with the fundamental nature of the knowledge domain. Furthermore, current probabilistic databases lack (in particular ontological) commonsense knowledge, which can often be exploited in reasoning to deduce implicit consequences from data, and which is often essential for querying large-scale probabilistic databases in an uncontrolled environment such as the Web.

The main goal of this proposal is to enhance large-scale probabilistic databases (and so to unlock their full data modelling potential) by more realistic data models, while preserving their computational properties. We are planning to develop different semantics for the resulting probabilistic databases and analyse their computational properties and sources of intractability. We are also planning to design practical scalable query answering algorithms for them, especially algorithms based on knowledge compilation techniques, extending existing knowledge compilation approaches and elaborating new ones, based on tensor factorisation and neural-symbolic knowledge compilation. We will also produce a prototype implementation and experimentally evaluate the proposed algorithms.

Planned Impact

We are proposing to lay the foundations for a new generation of probabilistic database systems that will revolutionise how we deal with probabilistic data, and unlock their full data modelling potential. As a special kind of Big Data, probabilistic data are being produced by an increasing number of applications, devices, and users, and one of their main challenges is how to deal with their incompleteness, most notably in the context of the World Wide Web. The commercial value of probabilistic data management is also reflected by the fact that the company Lattice.io, which grew out of the probabilistic database system DeepDive, has just been acquired by Apple for $175 million to $200 million. Big Data in general are critically important for a variety of different areas in science, industry, governments, and healthcare; their economic potential is enormous, estimated to exceed £50B annually in the UK (over 2.5% of the entire UK GDP). The beneficiaries could, in the long term, include anyone who uses or depends on probabilistic data, such as those collected on the Web. In the Western world at least, this effectively includes every business/organisation and every individual.

In the shorter term, the techniques for uncertainty and incompleteness handling to be developed in this project will exert a major influence on the theory and practice of probabilistic data management and of ontological data access in a more traditional database and information system context, both within and outside the academic community. Thus, our work will be of benefit to all those working in the broad area of information systems, including researchers - both in academia and industry - in the fields of databases and ontologies. Specifically, in the context of Web data, our work will allow to deal with uncertainty and incompleteness in knowledge bases that result from the ontology-based extraction and integration of probabilistic data from the Web. Thus, other short-term beneficiaries will include researchers in both academia and industry working on the extraction and integration of probabilistic data from the Web, as well as on query answering from the resulting knowledge bases.

Unsurprisingly, a strong interest in the project is coming from companies working in these areas, like our project partner Wrapidity. One business case that Wrapidity is interested in is understanding and extracting business information from Dark Data, e.g., to analyse investment trends in the region. This requires a (currently unavailable) scalable, expressive, and flexible reasoning on especially uncertain and incomplete data. Since incompleteness and ontological commonsense handling in probabilistic databases includes as special cases uncertainty and incompleteness handling in single and distributed ontologies, other short-term beneficiaries are researchers and developers dealing with large uncertain and incomplete ontologies.

In the longer term, beneficiaries will be data analysts working with probabilistic data who need to handle incomplete data, as well as researchers and developers who need to exploit probabilistic data in applications. Hence, our work will also be of benefit to researchers in science, industry, governments, and healthcare in such diverse areas as, e.g., biology, medicine, geography, astronomy, agriculture, and aerospace. For example, together with a natural language interface, the realistic probabilistic data models and their scalable query answering techniques to be developed in this project will in the long term pave the way for powerful question answering systems in healthcare or even for medical diagnosis.

The project will also be of wider benefit to the UK's research community by answering important open questions, contributing to the UK's research base, and helping to further cement the UK's world leadership in this research area. In summary, the project will contribute to enhance the UK's scientific relevance and excellence.

Funded Value:

£781,286

Funded Period:

Dec 17 - Feb 22

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/R013667/1

Principal Investigator:

Thomas Lukasiewicz

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Artificial Intelligence (10%)

Information & Knowledge Mgmt (90%)

Organisations

People	ORCID iD
Thomas Lukasiewicz (Principal Investigator)
Georg Gottlob (Co-Investigator)
Dan Olteanu (Co-Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 6 7 8 9 > >|

10 25 50

Abboud R (2021) The Surprising Power of Graph Neural Networks with Random Node Initialization

Abboud R (2020) Learning to Reason: Leveraging Neural Networks for Approximate DNF Counting in Proceedings of the AAAI Conference on Artificial Intelligence

Abboud R (2019) Learning to Reason: Leveraging Neural Networks for Approximate DNF Counting

Abboud R (2022) Approximate weighted model integration on DNF structures in Artificial Intelligence

Abboud R (2020) The Surprising Power of Graph Neural Networks with Random Node Initialization

Abboud R. (2020) Learning to reason: Leveraging neural networks for approximate dnf counting in AAAI 2020 - 34th AAAI Conference on Artificial Intelligence

Amador-Domínguez E (2021) An ontology-based deep learning approach for triple classification with out-of-knowledge-base entities in Information Sciences

Amarilli A (2022) The Dichotomy of Evaluating Homomorphism-Closed Queries on Probabilistic Graphs in Logical Methods in Computer Science

Amarilli A. (2020) A dichotomy for homomorphism-closed queries on probabilistic graphs in Leibniz International Proceedings in Informatics, LIPIcs

Anelli V (2020) Combining RDF and SPARQL with CP-theories to reason about preferences in a Linked Data setting in Semantic Web

Key Findings
Collaboration


Description	Large-scale knowledge bases are at the heart of modern information systems. Their knowledge is inherently uncertain, and hence they are often materialized as probabilistic databases. However, probabilistic database management systems typically lack the capability to incorporate implicit background knowledge and, consequently, fail to capture some intuitive query answers. Ontology-mediated query answering is a popular paradigm for encoding commonsense knowledge, which can provide more complete answers to user queries. In the AAAI 2019 paper "Ontology-mediated query answering over log-linear probabilistic data", we propose a new data model that integrates the paradigm of ontology-mediated query answering with probabilistic databases, employing a log-linear probability model. We compare our approach to existing proposals, and provide supporting computational results. Weighted model counting (WMC) has emerged as a prevalent approach for probabilistic inference. In its most general form, WMC is #P-hard. Weighted DNF counting (weighted #DNF) is a special case, where approximations with probabilistic guarantees are obtained in O(nm), where n denotes the number of variables, and m the number of clauses of the input DNF, but this is not scalable in practice. In the AAAI 2020 paper "Learning to reason: Leveraging neural networks for approximate DNF counting", we propose a neural model counting approach for weighted #DNF that combines approximate model counting with deep learning, and accurately approximates model counts in linear time when width is bounded. We conduct experiments to validate our method, and show that our model learns and generalizes very well to large-scale #DNF instances. Large-scale probabilistic knowledge bases are becoming increasingly important in academia and industry. They are continuously extended with new data, powered by modern information extraction tools that associate probabilities with knowledge base facts. The state of the art to store and process such data is founded on probabilistic databases. Many systems based on probabilistic databases, however, still have certain semantic deficiencies, which limit their potential applications. in the AIJ 2021 Paper "Open-world probabilistic databases: Semantics, algorithms, complexity", we revisit the semantics of probabilistic databases, and argue that the closed-world assumption of probabilistic databases, i.e., the assumption that facts not appearing in the database have the probability zero, conflicts with the everyday use of large-scale probabilistic knowledge bases. To address this discrepancy, we propose open-world probabilistic databases, as a new probabilistic data model. In this new data model, the probabilities of unknown facts, also called open facts, can be assigned any probability value from a default probability interval. Our analysis entails that our model aligns better with many real-world tasks such as query answering, relational learning, knowledge base completion, and rule mining. We make various technical contributions. We show that the data complexity dichotomy, between polynomial time and #P, for evaluating unions of conjunctive queries on probabilistic databases can be lifted to our open-world model. This result is supported by an algorithm that computes the probabilities of the so-called safe queries efficiently. Based on this algorithm, we prove that evaluating safe queries is in linear time for probabilistic databases, under reasonable assumptions. This remains true in open-world probabilistic databases for a more restricted class of safe queries. We extend our data complexity analysis beyond unions of conjunctive queries, and obtain a host of complexity results for both classical and open-world probabilistic databases. We conclude our analysis with an in-depth investigation of the combined complexity in the respective models. Graph neural networks (GNNs) are effective models for representation learning on relational data. However, standard GNNs are limited in their expressive power, as they cannot distinguish graphs beyond the capability of the Weisfeiler-Leman graph isomorphism heuristic. In order to break this expressiveness barrier, GNNs have been enhanced with random node initialization (RNI), where the idea is to train and run the models with randomized initial node features. In the IJCAI 2021 paper "The surprising power of graph neural networks with random node initialization", we analyze the expressive power of GNNs with RNI, and prove that these models are universal, a first such result for GNNs not relying on computationally demanding higher-order properties. This universality result holds even with partially randomized initial node features, and preserves the invariance properties of GNNs in expectation. We then empirically analyze the effect of RNI on GNNs, based on carefully constructed datasets. Our empirical findings support the superior performance of GNNs with RNI over standard GNNs. Visual question answering (VQA) is a challenging problem in machine perception, which requires a deep joint understanding of both visual and textual data. Recent research has advanced the automatic generation of high-quality scene graphs from images, while powerful yet elegant models like graph neural networks (GNNs) have shown great power in reasoning over graph-structured data. In the CIKM 2021 paper "Lightweight visual question answering using scene graphs", we propose to bridge the gap between scene graph generation and VQA by leveraging GNNs. In particular, we design a new model called Conditional Enhanced Graph ATtention network (CE-GAT) to encode pairs of visual and semantic scene graphs with both node and edge features, which is seamlessly integrated with a textual question encoder to generate answers through question-graph conditioning. Moreover, to alleviate the training difficulties of CE-GAT towards VQA, we enforce more useful inductive biases in the scene graphs through novel question-guided graph enriching and pruning.
Exploitation Route	Some of the findings could be integrated into commercial database engines for enhanced uncertain data management in applications such as healthcare, finance, or IoT. Some findings could also be used for scalable probabilistic inference in applications such as fraud detection, risk assessment, or recommendation systems requiring efficient approximation, and to improve industrial knowledge graphs (e.g., Google Knowledge Graph) by handling incomplete data and default probabilities for missing facts. Furthermore, some findings could be used for GNN libraries to boost expressiveness in drug discovery, social network analysis, or fraud detection, or in robotics, autonomous vehicles, or accessibility tools (e.g., scene understanding for visually impaired users). Overall, the findings democratize robust uncertainty-aware reasoning, enabling industries to handle incomplete/noisy data while maintaining scalability and interpretability.
Sectors	Digital/Communication/Information Technologies (including Software) Education Healthcare


Description	Collaboration with Antoine Amarilli
Organisation	Telecom Paris
Country	France
Sector	Academic/University
PI Contribution	Joint research towards the ICDT 2020 paper "A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs".
Collaborator Contribution	Joint research towards the ICDT 2020 paper "A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs".
Impact	ICDT 2020 paper "A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs".
Start Year	2019


Description	Collaboration with Phokion Kolaitis
Organisation	IBM Research - Almaden
Country	United States
Sector	Private
PI Contribution	Collaboration has just started.
Collaborator Contribution	Collaboration has just started.
Impact	Collaboration has just started.
Start Year	2021

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications