Re: ZStandard (with dictionaries) compression support for TOAST compression - Mailing list pgsql-hackers
From | Nikhil Kumar Veldanda |
---|---|
Subject | Re: ZStandard (with dictionaries) compression support for TOAST compression |
Date | |
Msg-id | CAFAfj_FbdeabZEqU_vCCLar07DvF0SJRgimW8A7ZmD=UPwE6VA@mail.gmail.com Whole thread Raw |
In response to | Re: ZStandard (with dictionaries) compression support for TOAST compression (Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com>) |
List | pgsql-hackers |
Hi all, Attached an updated version of the patch. Specifically, I've removed changes related to the TOAST pointer structure. This proposal is different from earlier discussions on this topic[1], where extending the TOAST pointer was considered essential for enabling dictionary-based compression. Key improvements introduced in this proposal: 1. No Changes to TOAST Pointer: The existing TOAST pointer structure remains untouched, simplifying integration and minimizing potential disruptions. 2. Extensible Design: The solution is structured to seamlessly incorporate future compression algorithms beyond zstd [2], providing greater flexibility and future-proofing. 3. Inline Data Compression with Dictionary Support: Crucially, this approach supports dictionary-based compression for inline data. Dictionaries are highly effective for compressing small-sized documents, providing substantial storage savings. Please refer to the attached image from the zstd README[2] for supporting evidence. Omitting dictionary-based compression for inline data would significantly reduce these benefits. For example, under previous design constraints [3], if a 16KB document compressed down to 256 bytes using a dictionary, storing this inline would not have been feasible. The current proposal addresses this limitation, thereby fully leveraging dictionary-based compression. I believe this solution effectively addresses the limitations identified in our earlier discussions [1][3]. Feedback on this approach would be greatly appreciated, I welcome any feedback or suggestions you might have. References: [1] https://www.postgresql.org/message-id/flat/CAJ7c6TPSN06C%2B5cYSkyLkQbwN1C%2BpUNGmx%2BVoGCA-SPLCszC8w%40mail.gmail.com [2] https://github.com/facebook/zstd [3] https://www.postgresql.org/message-id/CAJ7c6TPSN06C%2B5cYSkyLkQbwN1C%2BpUNGmx%2BVoGCA-SPLCszC8w%40mail.gmail.com ``` typedef union { struct /* Normal varlena (4-byte length) */ { uint32 va_header; char va_data[FLEXIBLE_ARRAY_MEMBER]; } va_4byte; struct /* Compressed-in-line format */ { uint32 va_header; uint32 va_tcinfo; /* Original data size (excludes header) and * compression method; see va_extinfo */ char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Compressed data */ } va_compressed; struct { uint32 va_header; uint32 va_tcinfo; uint32 va_cmp_alg; uint32 va_cmp_dictid; char va_data[FLEXIBLE_ARRAY_MEMBER]; } va_compressed_ext; } varattrib_4b; ``` Additional algorithm information and dictid is stored in varattrib_4b. Best regards, Nikhil Veldanda On Fri, Mar 7, 2025 at 5:35 PM Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com> wrote: > > Hi, > > I reviewed the discussions, and while most agreements focused on > changes to the toast pointer, the design I propose requires no > modifications to it. I’ve carefully considered the design choices made > previously, and I recognize Zstd’s clear advantages in compression > efficiency and performance over algorithms like PGLZ and LZ4, we can > integrate it without altering the existing toast pointer > (varatt_external) structure. > > By simply using the top two bits of the va_extinfo field (setting them > to '11') in `varatt_external`, we can signal an alternative > compression algorithm, clearly distinguishing new methods from legacy > ones. The specific algorithm used would then be recorded in the > va_cmp_alg field. > > This approach addresses the issues raised in the summarized thread[1] > and to leverage dictionaries for the data that can stay in-line. While > my initial patch includes modifications to toast_pointer due to a > single dependency on (pg_column_compression), those changes aren’t > strictly necessary; resolving that dependency separately would make > the overall design even less intrusive. > > Here’s an illustrative structure: > ``` > typedef union > { > struct /* Normal varlena (4-byte length) */ > { > uint32 va_header; > char va_data[FLEXIBLE_ARRAY_MEMBER]; > } va_4byte; > struct /* Current Compressed format */ > { > uint32 va_header; > uint32 va_tcinfo; /* Original size and compression method */ > char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Compressed data */ > } va_compressed; > struct /* Extended compression format */ > { > uint32 va_header; > uint32 va_tcinfo; > uint32 va_cmp_alg; > uint32 va_cmp_dictid; > char va_data[FLEXIBLE_ARRAY_MEMBER]; > } va_compressed_ext; > } varattrib_4b; > > typedef struct varatt_external > { > int32 va_rawsize; /* Original data size (includes header) */ > uint32 va_extinfo; /* External saved size (without header) and > * compression method */ `11` indicates new compression methods. > Oid va_valueid; /* Unique ID of value within TOAST table */ > Oid va_toastrelid; /* RelID of TOAST table containing it */ > } varatt_external; > ``` > > Decompression flow remains straightforward: once a datum is identified > as external, we detoast it, then we identify the compression algorithm > using ` > TOAST_COMPRESS_METHOD` macro which refers to a varattrib_4b structure > not a toast pointer. We retrieve the compression algorithm from either > va_tcinfo or va_cmp_alg based on adjusted macros, and decompress > accordingly. > > In summary, integrating Zstandard into the TOAST framework in this > minimally invasive way should yield substantial benefits. > > [1] https://www.postgresql.org/message-id/CAJ7c6TPSN06C%2B5cYSkyLkQbwN1C%2BpUNGmx%2BVoGCA-SPLCszC8w%40mail.gmail.com > > Best regards, > Nikhil Veldanda > > On Fri, Mar 7, 2025 at 3:42 AM Aleksander Alekseev > <aleksander@timescale.com> wrote: > > > > Hi Nikhil, > > > > > Thank you for highlighting the previous discussion—I reviewed [1] > > > closely. While both methods involve dictionary-based compression, the > > > approach I'm proposing differs significantly. > > > > > > The previous method explicitly extracted string values from JSONB and > > > assigned unique OIDs to each entry, resulting in distinct dictionary > > > entries for every unique value. In contrast, this approach directly > > > leverages Zstandard's dictionary training API. We provide raw data > > > samples to Zstd, which generates a dictionary of a specified size. > > > This dictionary is then stored in a catalog table and used to compress > > > subsequent inserts for the specific attribute it was trained on. > > > > > > [...] > > > > You didn't read closely enough I'm afraid. As Tom pointed out, the > > title of the thread is misleading. On top of that there are several > > separate threads. I did my best to cross-reference them, but > > apparently didn't do good enough. > > > > Initially I proposed to add ZSON extension [1][2] to the PostgreSQL > > core. However the idea evolved into TOAST improvements that don't > > require a user to use special types. You may also find interesting the > > related "Pluggable TOASTer" discussion [3]. The idea there was rather > > different but the discussion about extending TOAST pointers so that in > > the future we can use something else than ZSTD is relevant. > > > > You will find the recent summary of the reached agreements somewhere > > around this message [4], take a look at the thread a bit above and > > below it. > > > > I believe this effort is important. You can't, however, simply discard > > everything that was discussed in this area for the past several years. > > If you want to succeed of course. No one will look at your patch if it > > doesn't account for all the previous discussions. I'm sorry, I know > > it's disappointing. This being said you should have done better > > research before submitting the code. You could just ask if anyone was > > working on something like this before and save a lot of time. > > > > Personally I would suggest starting with one little step toward > > compression dictionaries. Particularly focusing on extendability of > > TOAST pointers. You are going to need to store dictionary ids there > > and allow using other compression algorithms in the future. This will > > require something like a varint/utf8-like bitmask for this. See the > > previous discussions. > > > > [1]: https://github.com/afiskon/zson > > [2]: https://postgr.es/m/CAJ7c6TP3fCC9TNKJBQAcEf4c%3DL7XQZ7QvuUayLgjhNQMD_5M_A%40mail.gmail.com > > [3]: https://postgr.es/m/224711f9-83b7-a307-b17f-4457ab73aa0a%40sigaev.ru > > [4]: https://postgr.es/m/CAJ7c6TPSN06C%2B5cYSkyLkQbwN1C%2BpUNGmx%2BVoGCA-SPLCszC8w%40mail.gmail.com > > > > -- > > Best regards, > > Aleksander Alekseev
Attachment
pgsql-hackers by date: