Re: ZStandard (with dictionaries) compression support for TOAST compression - Mailing list pgsql-hackers
From | Nikhil Kumar Veldanda |
---|---|
Subject | Re: ZStandard (with dictionaries) compression support for TOAST compression |
Date | |
Msg-id | CAFAfj_F-kx2=5s8fEjWkT_cCLf6ToenkRwBWcado5O8gn0J=0A@mail.gmail.com Whole thread Raw |
In response to | Re: ZStandard (with dictionaries) compression support for TOAST compression (Aleksander Alekseev <aleksander@timescale.com>) |
List | pgsql-hackers |
Hi, I reviewed the discussions, and while most agreements focused on changes to the toast pointer, the design I propose requires no modifications to it. I’ve carefully considered the design choices made previously, and I recognize Zstd’s clear advantages in compression efficiency and performance over algorithms like PGLZ and LZ4, we can integrate it without altering the existing toast pointer (varatt_external) structure. By simply using the top two bits of the va_extinfo field (setting them to '11') in `varatt_external`, we can signal an alternative compression algorithm, clearly distinguishing new methods from legacy ones. The specific algorithm used would then be recorded in the va_cmp_alg field. This approach addresses the issues raised in the summarized thread[1] and to leverage dictionaries for the data that can stay in-line. While my initial patch includes modifications to toast_pointer due to a single dependency on (pg_column_compression), those changes aren’t strictly necessary; resolving that dependency separately would make the overall design even less intrusive. Here’s an illustrative structure: ``` typedef union { struct /* Normal varlena (4-byte length) */ { uint32 va_header; char va_data[FLEXIBLE_ARRAY_MEMBER]; } va_4byte; struct /* Current Compressed format */ { uint32 va_header; uint32 va_tcinfo; /* Original size and compression method */ char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Compressed data */ } va_compressed; struct /* Extended compression format */ { uint32 va_header; uint32 va_tcinfo; uint32 va_cmp_alg; uint32 va_cmp_dictid; char va_data[FLEXIBLE_ARRAY_MEMBER]; } va_compressed_ext; } varattrib_4b; typedef struct varatt_external { int32 va_rawsize; /* Original data size (includes header) */ uint32 va_extinfo; /* External saved size (without header) and * compression method */ `11` indicates new compression methods. Oid va_valueid; /* Unique ID of value within TOAST table */ Oid va_toastrelid; /* RelID of TOAST table containing it */ } varatt_external; ``` Decompression flow remains straightforward: once a datum is identified as external, we detoast it, then we identify the compression algorithm using ` TOAST_COMPRESS_METHOD` macro which refers to a varattrib_4b structure not a toast pointer. We retrieve the compression algorithm from either va_tcinfo or va_cmp_alg based on adjusted macros, and decompress accordingly. In summary, integrating Zstandard into the TOAST framework in this minimally invasive way should yield substantial benefits. [1] https://www.postgresql.org/message-id/CAJ7c6TPSN06C%2B5cYSkyLkQbwN1C%2BpUNGmx%2BVoGCA-SPLCszC8w%40mail.gmail.com Best regards, Nikhil Veldanda On Fri, Mar 7, 2025 at 3:42 AM Aleksander Alekseev <aleksander@timescale.com> wrote: > > Hi Nikhil, > > > Thank you for highlighting the previous discussion—I reviewed [1] > > closely. While both methods involve dictionary-based compression, the > > approach I'm proposing differs significantly. > > > > The previous method explicitly extracted string values from JSONB and > > assigned unique OIDs to each entry, resulting in distinct dictionary > > entries for every unique value. In contrast, this approach directly > > leverages Zstandard's dictionary training API. We provide raw data > > samples to Zstd, which generates a dictionary of a specified size. > > This dictionary is then stored in a catalog table and used to compress > > subsequent inserts for the specific attribute it was trained on. > > > > [...] > > You didn't read closely enough I'm afraid. As Tom pointed out, the > title of the thread is misleading. On top of that there are several > separate threads. I did my best to cross-reference them, but > apparently didn't do good enough. > > Initially I proposed to add ZSON extension [1][2] to the PostgreSQL > core. However the idea evolved into TOAST improvements that don't > require a user to use special types. You may also find interesting the > related "Pluggable TOASTer" discussion [3]. The idea there was rather > different but the discussion about extending TOAST pointers so that in > the future we can use something else than ZSTD is relevant. > > You will find the recent summary of the reached agreements somewhere > around this message [4], take a look at the thread a bit above and > below it. > > I believe this effort is important. You can't, however, simply discard > everything that was discussed in this area for the past several years. > If you want to succeed of course. No one will look at your patch if it > doesn't account for all the previous discussions. I'm sorry, I know > it's disappointing. This being said you should have done better > research before submitting the code. You could just ask if anyone was > working on something like this before and save a lot of time. > > Personally I would suggest starting with one little step toward > compression dictionaries. Particularly focusing on extendability of > TOAST pointers. You are going to need to store dictionary ids there > and allow using other compression algorithms in the future. This will > require something like a varint/utf8-like bitmask for this. See the > previous discussions. > > [1]: https://github.com/afiskon/zson > [2]: https://postgr.es/m/CAJ7c6TP3fCC9TNKJBQAcEf4c%3DL7XQZ7QvuUayLgjhNQMD_5M_A%40mail.gmail.com > [3]: https://postgr.es/m/224711f9-83b7-a307-b17f-4457ab73aa0a%40sigaev.ru > [4]: https://postgr.es/m/CAJ7c6TPSN06C%2B5cYSkyLkQbwN1C%2BpUNGmx%2BVoGCA-SPLCszC8w%40mail.gmail.com > > -- > Best regards, > Aleksander Alekseev
pgsql-hackers by date: