Foundations of the Finite Model Theory of Concatenation

Lead Research Organisation: Loughborough University
Department Name: Computer Science

Abstract

The topic of this proposal is the finite model theory of concatenation (FC), a new logic that combines approaches from two different subfields of theoretical computer science and that has direct applications in information extraction and database theory, and eventual applications in all areas that deal with searching in or filtering of textual data.

Textual data is everywhere in modern life: No matter whether we are dealing with books, report, emails, social media, spreadsheets, or even HTML documents or log files, a huge amount of information is represented as a sequence of letters.

Information extraction (IE) deals with the problem of extracting relevant information from such data. One model for this are document spanners, a relational framework for IE that can be understood as querying text like one would query a database. Due to their similarity to relational queries, document spanners have recently become a trending topic in database theory.

Database theory uses methods from finite model theory (FMT), a sub-discipline of logic in computer science. FMT provided the inspiration for relational database, and both fields have maintained a close connection over these last fifty years. Of particular importance are methods for efficient evaluation of queries and deep insights of what can and cannot be expressed with certain types of queries. Every result in FMT is also a result on databases, as there is an immediate one-to-one correspondence between logical formulas and database queries: Every database query corresponds to a logical formula, and every logical formula corresponds to a database query.

This is well-understood for relational databases, but it does directly translate to the textual setting of document spanners: The canonical FMT-approaches to text are provably too weak to model document spanners. But to be able to evaluate, optimize, and compare queries on texts, having this for document spanners would be tremendously useful.

As first steps in this direction, previous research used "the theory of concatenation", a different logic from combinatorics on words (CoW), another sub-discipline of theoretical computer science, that studies patterns in texts. But this logic corresponds to infinite model theory, not finite model theory, which makes many useful techniques from FMT unavailable.

To address this problem, the proposed new logic FC combines the theory of concatenation from CoW with the finite model approach from FMT. As shown in the preparatory research, FC has exactly the required expressive power of document spanners while still allowing the use of many classical FMT techniques.

Apart from the connection to document spanners, FC can be seen as a natural variant of the logics that have previously been used: From an FMT-point of view, FC can be seen as extending the canonical first-order logic on strings with a string equality operator. From a CoW-point of view, FC is the finite model version of the theory of concatenation. As such, FC is a fundamental framework for using logic on textual data, and it can directly be extended to cover other operations, like length constraints as they are used in string verification.

The preparatory research introduces FC, proves it has the desired one-to-one corresponds to document spanners, and shows that many FMT-techniques can be adapted. But many open questions remain, in particular regarding two of the most important aspects: The efficient evaluation of queries, and the fundamental limitations of the framework -- or, less formally, which queries can be evaluated quickly, which algorithms should be used for this, and what is possible in this querying model?

This project will close this gap by creating a solid theoretical foundation for FC. This will require new proof techniques that at the very least merge those from CoW and FMT, or even go far beyond them. The ultimate goal is that FC will do for querying text what FMT did for databases.

Planned Impact

This project is at the start of a research pipeline that begins with examining FC, a new logic that operates on texts. The results and techniques that are developed in this project will enable FC to provide the same benefits to querying text that first-order logic (as it is used in finite model theory) provides to querying relational databases.

A big advantage of relational databases over their precursors is that their declarative languages (like SQL) allow the user to write their queries in a form that like "Give me all X such that Y" instead of writing an algorithm that iterates over the data in the database.

In fact, the user does not even need to know how the algorithm that evaluates the query works. If queries need to run particularly fast, knowledge of the evaluation algorithm is less useful than knowing how to structure a database and a query. In addition to this, large scale database systems often optimize queries automatically.
For both types of optimization (by hand or automatically), the underlying theory builds on finite model theory.

Currently, there are systems that offer similar queries for text, like IBM's SystemT that provided the inspiration for document spanners, a formal model that directly corresponds to FC. But there are few theoretical insights that offer advice how to structure or optimize queries for these. One particular insight is that approaches that work for relational databases do not directly translate to the textual setting: While one could first store the intermediate results of these queries in a relational database and then run a "normal" relational query, previous research show that these intermediate databases can become unfeasibly large.

In contrast to this, there is a direct connection between FC and document spanners. This means that insights into FC directly convert into advice into how to write algorithms that evaluate queries on texts in the document spanner model.

This project will develop the foundations of FC and lay the groundwork for systems that implement these algorithms - first prototypes, then actual implementations.

Eventually, this will allow the creation of declarative text querying systems that, on the one hand, allow the querying or even rewriting of text but, on the other hand, still offer performance guarantees and can optimize queries.

This will enable users to write text queries that are powerful without needing to learn a full programming language and without requiring deep insights into the underlying algorithms.

Applications include searching large documents (like books or reports), or large collections of small documents (like social media postings), data packets, log files, or semi-structured documents (like XML, HTML, or JSON). Users could run queries like "find all email addresses that occur twice in the file" or "find all pairs of place names that are mentioned together in the same sentence".

Although this impact is far in the future and requires significant effort far beyond this project, the societal and economical benefits of faster and more powerful means of extracting and managing information are self-evident.

Publications

10 25 50
publication icon
Freydenberger D.D. (2021) The theory of concatenation over finite models in Leibniz International Proceedings in Informatics, LIPIcs

publication icon
Freydenberger D.D. (2022) Splitting Spanner Atoms: A Tool for Acyclic Core Spanners in Leibniz International Proceedings in Informatics, LIPIcs

 
Description Cooperation with Liat Peterfreund 
Organisation Laboratoire d'Informatique Gaspard-Monge
Country France 
Sector Public 
PI Contribution Co-developed the logic FC that is the main topic of this award, leading to the paper at ICALP 2021.
Collaborator Contribution Co-developed the logic FC that is the main topic of this award, leading to the paper at ICALP 2021.
Impact Freydenberger DD, Peterfreund L. (ICALP 2021). The Theory of Concatenation over Finite Models.
Start Year 2018
 
Description Cooperation with Liat Peterfreund 
Organisation University of Edinburgh
Department School of Informatics Edinburgh
Country United Kingdom 
Sector Academic/University 
PI Contribution Co-developed the logic FC that is the main topic of this award, leading to the paper at ICALP 2021.
Collaborator Contribution Co-developed the logic FC that is the main topic of this award, leading to the paper at ICALP 2021.
Impact Freydenberger DD, Peterfreund L. (ICALP 2021). The Theory of Concatenation over Finite Models.
Start Year 2018
 
Title Marisa Patel's QueryTextLib 
Description A first prototype implementation of the relational query language that implements FC. Create by Marisa Patel as her MSci project. 
Type Of Technology Webtool/Application 
Year Produced 2022 
Open Source License? Yes  
Impact None yet. This is part of the wider plan to create prototype implementations that will, eventually, lead to a usable project.