Home > mailing lists
Re: ZStandard (with dictionaries) compression support for TOAST compression - Mailing list pgsql-hackers

From	Nikhil Kumar Veldanda
Subject	Re: ZStandard (with dictionaries) compression support for TOAST compression
Date	March 8 19:28:17
Msg-id	CAFAfj_FbdeabZEqU_vCCLar07DvF0SJRgimW8A7ZmD=UPwE6VA@mail.gmail.com Whole thread Raw
In response to	Re: ZStandard (with dictionaries) compression support for TOAST compression (Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com>)
List	pgsql-hackers
Tree view
Hi all,

Attached an updated version of the patch. Specifically, I've removed
changes related to the TOAST pointer structure. This proposal is
different from earlier discussions on this topic[1], where extending
the TOAST pointer was considered essential for enabling
dictionary-based compression.

Key improvements introduced in this proposal:

1. No Changes to TOAST Pointer: The existing TOAST pointer structure
remains untouched, simplifying integration and minimizing potential
disruptions.

2. Extensible Design: The solution is structured to seamlessly
incorporate future compression algorithms beyond zstd [2], providing
greater flexibility and future-proofing.

3. Inline Data Compression with Dictionary Support: Crucially, this
approach supports dictionary-based compression for inline data.
Dictionaries are highly effective for compressing small-sized
documents, providing substantial storage savings. Please refer to the
attached image from the zstd README[2] for supporting evidence.
Omitting dictionary-based compression for inline data would
significantly reduce these benefits. For example, under previous
design constraints [3], if a 16KB document compressed down to 256
bytes using a dictionary, storing this inline would not have been
feasible. The current proposal addresses this limitation, thereby
fully leveraging dictionary-based compression.

I believe this solution effectively addresses the limitations
identified in our earlier discussions [1][3].

Feedback on this approach would be greatly appreciated, I welcome any
feedback or suggestions you might have.

References:
[1]
https://www.postgresql.org/message-id/flat/CAJ7c6TPSN06C%2B5cYSkyLkQbwN1C%2BpUNGmx%2BVoGCA-SPLCszC8w%40mail.gmail.com
[2] https://github.com/facebook/zstd
[3] https://www.postgresql.org/message-id/CAJ7c6TPSN06C%2B5cYSkyLkQbwN1C%2BpUNGmx%2BVoGCA-SPLCszC8w%40mail.gmail.com

```
typedef union
{
struct /* Normal varlena (4-byte length) */
{
uint32 va_header;
char va_data[FLEXIBLE_ARRAY_MEMBER];
} va_4byte;
struct /* Compressed-in-line format */
{
uint32 va_header;
uint32 va_tcinfo; /* Original data size (excludes header) and
* compression method; see va_extinfo */
char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Compressed data */
} va_compressed;
struct
{
uint32 va_header;
uint32 va_tcinfo;
uint32 va_cmp_alg;
uint32 va_cmp_dictid;
char va_data[FLEXIBLE_ARRAY_MEMBER];
} va_compressed_ext;
} varattrib_4b;
```
Additional algorithm information and dictid is stored in varattrib_4b.

Best regards,
Nikhil Veldanda

On Fri, Mar 7, 2025 at 5:35 PM Nikhil Kumar Veldanda
<veldanda.nikhilkumar17@gmail.com> wrote:
>
> Hi,
>
> I reviewed the discussions, and while most agreements focused on
> changes to the toast pointer, the design I propose requires no
> modifications to it. I’ve carefully considered the design choices made
> previously, and I recognize Zstd’s clear advantages in compression
> efficiency and performance over algorithms like PGLZ and LZ4, we can
> integrate it without altering the existing toast pointer
> (varatt_external) structure.
>
> By simply using the top two bits of the va_extinfo field (setting them
> to '11') in `varatt_external`, we can signal an alternative
> compression algorithm, clearly distinguishing new methods from legacy
> ones. The specific algorithm used would then be recorded in the
> va_cmp_alg field.
>
> This approach addresses the issues raised in the summarized thread[1]
> and to leverage dictionaries for the data that can stay in-line. While
> my initial patch includes modifications to toast_pointer due to a
> single dependency on (pg_column_compression), those changes aren’t
> strictly necessary; resolving that dependency separately would make
> the overall design even less intrusive.
>
> Here’s an illustrative structure:
> ```
> typedef union
> {
>     struct    /* Normal varlena (4-byte length) */
>     {
>         uint32    va_header;
>         char    va_data[FLEXIBLE_ARRAY_MEMBER];
>     }    va_4byte;
>     struct    /* Current Compressed format */
>     {
>         uint32    va_header;
>         uint32    va_tcinfo;    /* Original size and compression method */
>         char    va_data[FLEXIBLE_ARRAY_MEMBER]; /* Compressed data */
>     }    va_compressed;
>     struct    /* Extended compression format */
>     {
>         uint32    va_header;
>         uint32    va_tcinfo;
>         uint32    va_cmp_alg;
>         uint32    va_cmp_dictid;
>         char    va_data[FLEXIBLE_ARRAY_MEMBER];
>     }    va_compressed_ext;
> } varattrib_4b;
>
> typedef struct varatt_external
> {
> int32 va_rawsize; /* Original data size (includes header) */
> uint32 va_extinfo; /* External saved size (without header) and
> * compression method */ `11` indicates new compression methods.
> Oid va_valueid; /* Unique ID of value within TOAST table */
> Oid va_toastrelid; /* RelID of TOAST table containing it */
> } varatt_external;
> ```
>
> Decompression flow remains straightforward: once a datum is identified
> as external, we detoast it, then we identify the compression algorithm
> using `
> TOAST_COMPRESS_METHOD` macro which refers to a varattrib_4b structure
> not a toast pointer. We retrieve the compression algorithm from either
> va_tcinfo or va_cmp_alg based on adjusted macros, and decompress
> accordingly.
>
> In summary, integrating Zstandard into the TOAST framework in this
> minimally invasive way should yield substantial benefits.
>
> [1] https://www.postgresql.org/message-id/CAJ7c6TPSN06C%2B5cYSkyLkQbwN1C%2BpUNGmx%2BVoGCA-SPLCszC8w%40mail.gmail.com
>
> Best regards,
> Nikhil Veldanda
>
> On Fri, Mar 7, 2025 at 3:42 AM Aleksander Alekseev
> <aleksander@timescale.com> wrote:
> >
> > Hi Nikhil,
> >
> > > Thank you for highlighting the previous discussion—I reviewed [1]
> > > closely. While both methods involve dictionary-based compression, the
> > > approach I'm proposing differs significantly.
> > >
> > > The previous method explicitly extracted string values from JSONB and
> > > assigned unique OIDs to each entry, resulting in distinct dictionary
> > > entries for every unique value. In contrast, this approach directly
> > > leverages Zstandard's dictionary training API. We provide raw data
> > > samples to Zstd, which generates a dictionary of a specified size.
> > > This dictionary is then stored in a catalog table and used to compress
> > > subsequent inserts for the specific attribute it was trained on.
> > >
> > > [...]
> >
> > You didn't read closely enough I'm afraid. As Tom pointed out, the
> > title of the thread is misleading. On top of that there are several
> > separate threads. I did my best to cross-reference them, but
> > apparently didn't do good enough.
> >
> > Initially I proposed to add ZSON extension [1][2] to the PostgreSQL
> > core. However the idea evolved into TOAST improvements that don't
> > require a user to use special types. You may also find interesting the
> > related "Pluggable TOASTer" discussion [3]. The idea there was rather
> > different but the discussion about extending TOAST pointers so that in
> > the future we can use something else than ZSTD is relevant.
> >
> > You will find the recent summary of the reached agreements somewhere
> > around this message [4], take a look at the thread a bit above and
> > below it.
> >
> > I believe this effort is important. You can't, however, simply discard
> > everything that was discussed in this area for the past several years.
> > If you want to succeed of course. No one will look at your patch if it
> > doesn't account for all the previous discussions. I'm sorry, I know
> > it's disappointing. This being said you should have done better
> > research before submitting the code. You could just ask if anyone was
> > working on something like this before and save a lot of time.
> >
> > Personally I would suggest starting with one little step toward
> > compression dictionaries. Particularly focusing on extendability of
> > TOAST pointers. You are going to need to store dictionary ids there
> > and allow using other compression algorithms in the future. This will
> > require something like a varint/utf8-like bitmask for this. See the
> > previous discussions.
> >
> > [1]: https://github.com/afiskon/zson
> > [2]: https://postgr.es/m/CAJ7c6TP3fCC9TNKJBQAcEf4c%3DL7XQZ7QvuUayLgjhNQMD_5M_A%40mail.gmail.com
> > [3]: https://postgr.es/m/224711f9-83b7-a307-b17f-4457ab73aa0a%40sigaev.ru
> > [4]: https://postgr.es/m/CAJ7c6TPSN06C%2B5cYSkyLkQbwN1C%2BpUNGmx%2BVoGCA-SPLCszC8w%40mail.gmail.com
> >
> > --
> > Best regards,
> > Aleksander Alekseev
Attachment

pgsql-hackers by date:
From: Andreas Karlsson
Date: 08 March, 19:15:23
Subject: Re: Special-case executor expression steps for common combinations
From: Peter Geoghegan
Date: 08 March, 19:43:33
Subject: Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Re: ZStandard (with dictionaries) compression support for TOAST compression - Mailing list pgsql-hackers

Attachment

Previous

Next