RealPDBs: Realistic Data Models and Query Compilation for Large-Scale Probabilistic Databases

Lead Research Organisation: University of Oxford
Department Name: Computer Science


In the recent years, there has been a strong interest in academia and industry in building large-scale probabilistic knowledge bases from data in an automated way, which has resulted in a number of systems, such as DeepDive, NELL, Yago, Freebase, Microsoft's Probase, and Google's Knowledge Vault. These systems continuously crawl the Web and extract structured information, and thus populate their databases with millions of entities and billions of tuples. To what extent can these search and extraction systems help with real-world use cases? This turns out to be an open-ended question. For example, DeepDive is used to build knowledge bases for domains such as paleontology, geology, medical genetics, and human movement. From a broader perspective, the quest for building large-scale knowledge bases serves as a new dawn for artificial intelligence research. Fields such as information extraction, natural language processing (e.g., question answering), relational and deep learning, knowledge representation and reasoning, and databases are taking initiative towards a common goal. Querying large-scale probabilistic knowledge bases is commonly regarded to be at the heart of these efforts.

Beyond all these success stories, however, probabilistic knowledge bases still lack the fundamental machinery to convey some of the valuable knowledge hidden in them to the end user, which seriously limits their potential applications in practice. These problems are rooted in the semantics of (tuple-independent) probabilistic databases, which are used for encoding most probabilistic knowledge bases. For computational efficiency reasons, probabilistic databases are typically based on strong, unrealistic completeness assumptions, such as the closed-world assumption, the tuple-independence assumption, and the lack of commonsense knowledge. These strong unrealistic assumptions do not only lead to unwanted consequences, but also put probabilistic databases on weak footing in terms of knowledge base learning, completion, and querying. More specifically, each of the above systems encodes only a portion of the real world, and this description is necessarily incomplete; these systems continuously crawl the Web, encounter new sources, and consequently new facts, leading them to add such facts to their database. However, when it comes to querying, most of these systems employ the closed-world assumption, i.e., any fact that is not present in the database is assigned the probability 0, and thus assumed to be impossible. As a closely related problem, it is common practice to view every extracted fact as an independent Bernoulli variable, i.e., any two facts are probabilistically independent. For example, the fact that a person starred in a movie is independent from the fact that this person is an actor, which is in conflict with the fundamental nature of the knowledge domain. Furthermore, current probabilistic databases lack (in particular ontological) commonsense knowledge, which can often be exploited in reasoning to deduce implicit consequences from data, and which is often essential for querying large-scale probabilistic databases in an uncontrolled environment such as the Web.

The main goal of this proposal is to enhance large-scale probabilistic databases (and so to unlock their full data modelling potential) by more realistic data models, while preserving their computational properties. We are planning to develop different semantics for the resulting probabilistic databases and analyse their computational properties and sources of intractability. We are also planning to design practical scalable query answering algorithms for them, especially algorithms based on knowledge compilation techniques, extending existing knowledge compilation approaches and elaborating new ones, based on tensor factorisation and neural-symbolic knowledge compilation. We will also produce a prototype implementation and experimentally evaluate the proposed algorithms.

Planned Impact

We are proposing to lay the foundations for a new generation of probabilistic database systems that will revolutionise how we deal with probabilistic data, and unlock their full data modelling potential. As a special kind of Big Data, probabilistic data are being produced by an increasing number of applications, devices, and users, and one of their main challenges is how to deal with their incompleteness, most notably in the context of the World Wide Web. The commercial value of probabilistic data management is also reflected by the fact that the company, which grew out of the probabilistic database system DeepDive, has just been acquired by Apple for $175 million to $200 million. Big Data in general are critically important for a variety of different areas in science, industry, governments, and healthcare; their economic potential is enormous, estimated to exceed £50B annually in the UK (over 2.5% of the entire UK GDP). The beneficiaries could, in the long term, include anyone who uses or depends on probabilistic data, such as those collected on the Web. In the Western world at least, this effectively includes every business/organisation and every individual.

In the shorter term, the techniques for uncertainty and incompleteness handling to be developed in this project will exert a major influence on the theory and practice of probabilistic data management and of ontological data access in a more traditional database and information system context, both within and outside the academic community. Thus, our work will be of benefit to all those working in the broad area of information systems, including researchers - both in academia and industry - in the fields of databases and ontologies. Specifically, in the context of Web data, our work will allow to deal with uncertainty and incompleteness in knowledge bases that result from the ontology-based extraction and integration of probabilistic data from the Web. Thus, other short-term beneficiaries will include researchers in both academia and industry working on the extraction and integration of probabilistic data from the Web, as well as on query answering from the resulting knowledge bases.

Unsurprisingly, a strong interest in the project is coming from companies working in these areas, like our project partner Wrapidity. One business case that Wrapidity is interested in is understanding and extracting business information from Dark Data, e.g., to analyse investment trends in the region. This requires a (currently unavailable) scalable, expressive, and flexible reasoning on especially uncertain and incomplete data. Since incompleteness and ontological commonsense handling in probabilistic databases includes as special cases uncertainty and incompleteness handling in single and distributed ontologies, other short-term beneficiaries are researchers and developers dealing with large uncertain and incomplete ontologies.

In the longer term, beneficiaries will be data analysts working with probabilistic data who need to handle incomplete data, as well as researchers and developers who need to exploit probabilistic data in applications. Hence, our work will also be of benefit to researchers in science, industry, governments, and healthcare in such diverse areas as, e.g., biology, medicine, geography, astronomy, agriculture, and aerospace. For example, together with a natural language interface, the realistic probabilistic data models and their scalable query answering techniques to be developed in this project will in the long term pave the way for powerful question answering systems in healthcare or even for medical diagnosis.

The project will also be of wider benefit to the UK's research community by answering important open questions, contributing to the UK's research base, and helping to further cement the UK's world leadership in this research area. In summary, the project will contribute to enhance the UK's scientific relevance and excellence.


10 25 50

publication icon
Thomas Lukasiewicz (2021) Complexity Results for Preference Aggregation over (m)CP-Nets: Max and Rank Voting in Artificial Intelligence

publication icon
Salvatori T (2022) Reverse Differentiation via Predictive Coding in Proceedings of the AAAI Conference on Artificial Intelligence

publication icon
Rosati J (2018) Combining RDF and SPARQL with CP-Theories to Reason about Preferences in a Linked Data Setting in Semantic Web - Interoperability, Usability, Applicability

publication icon
Rezounenko A (2023) Viral Infection Model with Diffusion and Distributed Delay: Finite-Dimensional Global Attractor. in Qualitative theory of dynamical systems

publication icon
Rao S (2022) An Explainable Transformer-Based Deep Learning Model for the Prediction of Incident Heart Failure. in IEEE journal of biomedical and health informatics

Description Collaboration with Antoine Amarilli 
Organisation Telecom Paris
Country France 
Sector Academic/University 
PI Contribution Joint research towards the ICDT 2020 paper "A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs".
Collaborator Contribution Joint research towards the ICDT 2020 paper "A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs".
Impact ICDT 2020 paper "A Dichotomy for Homomorphism-Closed Queries on Probabilistic Graphs".
Start Year 2019
Description Collaboration with Phokion Kolaitis 
Organisation IBM Research - Almaden
Country United States 
Sector Private 
PI Contribution Collaboration has just started.
Collaborator Contribution Collaboration has just started.
Impact Collaboration has just started.
Start Year 2021