Transparent Compression for General-Purpose Programming Languages
Lead Research Organisation:
Imperial College London
Department Name: Computing
Abstract
Lossless compression is a key optimisation technique when storing, transmitting or processing datasets of any size and kind. Compression allows managing datasets larger than primary storage medium and reduces the bandwidth need for data access in applications ranging from physics simulations to machine learning models to sensor data management. Some compression schemes allow computation to be performed directly on compressed data, often reducing computational complexity compared to uncompressed data - such schemes are commonly called "lightweight". For example, summing values in a Run-Length-Encoded (RLE) array requires effort in the order of the size of the compressed input, i.e., usually significantly less than the uncompressed input. Similarly, relational selections and joins can be prefiltered on the dictionary of a dictionary-compressed relation or the minimum-value in a frame-of-reference encoded column.
Implementing algorithms to work directly on compressed data is challenging: the computation has to be tightly integrated with the (de-)compression code and inefficiencies in the implementation can easily outweigh benefits in data transfer or processing performance. Consequently, most lightweight compression schemes are bespoke, i.e., developed and tuned for specific, well-understood domains such as relational databases, image processing or linear algebra. While the state-of-the-art is to implement them manually, virtually all lightweight compression schemes can be expressed as sequences of primitive transformations such as Run-Length-Encoding (RLE), dictionary compression or Huffman-coding. Examples of such "compression pipelines" are PFOR-delta (in the Vector DBMS), Vertipaq (in Microsoft SQL Server) or the Compressed-Sparse-Row matrix representation (in many linear algebra packages).
However, there are three fundamental problems with the state of the art: first, there is no principled way to assemble these pipelines. Second, these schemes are tied to a specific data/processing model (relational algebra, linear algebra, etc.). Third, and most importantly, the implementation effort is high as every application needs to implement compression from scratch. Unsurprisingly many applications that could benefit from compression shy away from that implementation effort: in particular for domain-scientists writing code in languages like Python or R, low implementation effort takes precedence over efficiency.
Our vision is to make the benefits of performance-oriented compression available to applications beyond the mentioned few. For that purpose, we will develop an algebraic framework for the representation and optimisation of bespoke compression schemes in general-purpose programming languages. Instead of "weaving" hundreds of lines of compression-related code into an application's logic, developers will express compression schemes as annotations on collections. The backend transparently transforms code that operates on the vector to take advantage of the compression strategy. This allows even non-experts to implement bespoke compression schemes. Simplifying the interface even further, we will implement a fully automated approach that determines the most appropriate compression scheme for a program, dataset and hardware platform using cost-based optimisation rather than requiring to have it explicitly specified.
Implementing algorithms to work directly on compressed data is challenging: the computation has to be tightly integrated with the (de-)compression code and inefficiencies in the implementation can easily outweigh benefits in data transfer or processing performance. Consequently, most lightweight compression schemes are bespoke, i.e., developed and tuned for specific, well-understood domains such as relational databases, image processing or linear algebra. While the state-of-the-art is to implement them manually, virtually all lightweight compression schemes can be expressed as sequences of primitive transformations such as Run-Length-Encoding (RLE), dictionary compression or Huffman-coding. Examples of such "compression pipelines" are PFOR-delta (in the Vector DBMS), Vertipaq (in Microsoft SQL Server) or the Compressed-Sparse-Row matrix representation (in many linear algebra packages).
However, there are three fundamental problems with the state of the art: first, there is no principled way to assemble these pipelines. Second, these schemes are tied to a specific data/processing model (relational algebra, linear algebra, etc.). Third, and most importantly, the implementation effort is high as every application needs to implement compression from scratch. Unsurprisingly many applications that could benefit from compression shy away from that implementation effort: in particular for domain-scientists writing code in languages like Python or R, low implementation effort takes precedence over efficiency.
Our vision is to make the benefits of performance-oriented compression available to applications beyond the mentioned few. For that purpose, we will develop an algebraic framework for the representation and optimisation of bespoke compression schemes in general-purpose programming languages. Instead of "weaving" hundreds of lines of compression-related code into an application's logic, developers will express compression schemes as annotations on collections. The backend transparently transforms code that operates on the vector to take advantage of the compression strategy. This allows even non-experts to implement bespoke compression schemes. Simplifying the interface even further, we will implement a fully automated approach that determines the most appropriate compression scheme for a program, dataset and hardware platform using cost-based optimisation rather than requiring to have it explicitly specified.
Publications
Mohr-Daurat H
(2024)
BOSS - An Architecture for Database Kernel Composition
in Proceedings of the VLDB Endowment
Mohr-Daurat H
(2024)
Hardware-Efficient Data Imputation through DBMS Extensibility
in Proceedings of the VLDB Endowment
| Description | We invested significant effort in the development of a transparent compression optimizer inside the Graal Framework. While simple optimizations are possible and working, we found that the intricacies of Graal (and low-level IRs like it) make the development of generic transparent compression very hard. Fotis Kounelis is in the process of writing up his Dissertation, outlining the technical challenges. We would consider this a negative result. However, we decided to refocus our efforts on the development of a novel Intermediate Representation that addresses the challenges. This led to a more modular approach: a new Intermediate Representation and a runtime for it. This approach was implemented in the BOSS project. While BOSS has a broader focus than "just" compression, it also serves as an IR for that purpose. While BOSS is still a relatively new project, it has already been recognized by a VLDB best paper award (VLDB is one of the top database conferences with an A* core ranking). |
| Exploitation Route | BOSS is an open-source framework. We are working to build a community around it. |
| Sectors | Digital/Communication/Information Technologies (including Software) |
| URL | http://boss.lsds.uk |
| Title | BOSS |
| Description | BOSS is a framework for the construction of data processing systems. Its focus lies on a composable architecture with high performance. |
| Type Of Technology | Software |
| Year Produced | 2024 |
| Open Source License? | Yes |
| Impact | Modern Data Management Systems increasingly abandon monolithic architectures in favor of compositions of specialized components. Storage layers like Parquet and Arrow are combined with kernels like Velox and RocksDB, optimizers like Apache Calcite or Orca and other specialized components to build systems optimized for a specific domain, execution environment or even application. Unfortunately, the architecture of Data Management Systems and the interfaces between components are the same as 30 years ago: highly efficient but rigid. This rigidity obstructs the adoption of novel ideas and techniques such as hardware acceleration, adaptive processing, learned optimization, or serverless execution in real-world systems. BOSS address this impasse through a novel approach to data management system composition inspired by two principles stemming from compiler-construction research: a homoiconic representation of data and code and partial evaluation of queries by components. BOSS achieves a fully composable design that effectively combines different data models, hardware platforms and processing engines. It implements features like GPU acceleration of relational queries and generative data cleaning without (measurable) overhead compared to a monolithic design. |
| URL | http://boss.lsds.uk |
