Re: ZStandard (with dictionaries) compression support for TOAST compression - Mailing list pgsql-hackers

From Robert Haas
Subject Re: ZStandard (with dictionaries) compression support for TOAST compression
Date
Msg-id CA+TgmobSLjaCQV3WkwePn7E25pS-Kov1WbhQ9Co4y6zqsO4nvA@mail.gmail.com
Whole thread Raw
In response to Re: ZStandard (with dictionaries) compression support for TOAST compression  (Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com>)
List pgsql-hackers
On Mon, Apr 28, 2025 at 5:32 PM Nikhil Kumar Veldanda
<veldanda.nikhilkumar17@gmail.com> wrote:
> Thanks for raising that question. The idea behind including a 24-bit
> length field alongside the 1-byte algorithm ID is to ensure that each
> compressed datum self-describes its metadata size. This allows any
> compression algorithm to embed variable-length metadata (up to 16 MB)
> without the need for hard-coding header sizes. For instance, an
> algorithm in feature might require different metadata lengths for each
> datum, and a fixed header size table wouldn’t work. By storing the
> length in the header, we maintain a generic and future-proof design. I
> would greatly appreciate any feedback on this design. Thanks!

I feel like I gave you some feedback on the design already, which was
that it seems like a waste of 3 bytes to me.

Don't get me wrong: I'm quite impressed by the way you're working on
this problem and I hope you stick around and keep working on it and
figure something out. But I don't quite understand the point of this
response: it seems like you're just restating what the design does
without really justifying it. The question here isn't whether a 3-byte
header can describe a length up to 16MB; I think we all know our
powers of two well enough to agree on the answer to that question. The
question is whether it's a good use of 3 bytes, and I don't think it
is.

I did consider the fact that future compression algorithms might want
to use variable-length headers; but I couldn't see a reason why we
shouldn't let each of those compression algorithms decide for
themselves how to encode whatever information they need. If a
compression algorithm needs a variable-length header, then it just
needs to make that header self-describing. Worst case scenario, it can
make the first byte of that variable-length header a length byte, and
then go from there; but it's probably possible to be even smarter and
use less than a full byte. Say for example we store a dictionary ID
that in concept is a 32-bit quantity but we use a variable-length
integer representation for it. It's easy to see that we shouldn't ever
need more than 3 bits for that so a full length byte is overkill and,
in fact, would undermine the value of a variable-length representation
rather severely. (I suspect it's a bad idea anyway, but it's a worse
idea if you burn a full byte on a length header.)

But there's an even larger question here too, which is why we're
having some kind of discussion about generalized metadata when the
current project seemingly only requires a 4-byte dictionary OID. If
you have some other use of this space in mind, I don't think you've
told us what it is. If you don't, then I'm not sure why we're
designing around an up-to-16MB variable-length quantity when what we
have before us is a 4-byte fixed-length quantity.

Moreover, even if you do have some (undisclosed) idea about what else
might be stored in this metadata area, why would it be important or
even desirable to have the length of that area represented in some
uniform way across compression methods? There's no obvious need for
any code outside the compression method itself to be able to decompose
the Datum into a metadata portion and a payload portion. After all,
the metadata portion could be anything so there's no way for anything
but the compression method to interpret it usefully. If we do want to
have outside code be able to ask questions, we could design some kind
of callback interface - e.g. if we end up with multiple compression
methods that store dictionary OIDs and they maybe do it in different
ways, each could provide an
"extract-the-dictionary-OID-from-this-datum" callback and each
compression method can implement that however it likes.

Maybe you can argue that we will eventually end up with various
compression method callbacks each of which is capable of working on
the metadata, and so then we might want to take an initial slice of a
toasted datum that is just big enough to allow that to work. But that
is pretty hypothetical, and in practice the first chunk of the TOAST
value (~2k) seems like it'd probably work well for most cases.

So, again, if you want us to take seriously the idea of dedicating 3
bytes per Datum to something, you need to give us a really good reason
for so doing. The fact a 24-bit metadata length can describe a
metadata header of up to 2^24 bits isn't a reason, good or bad. It's
just math.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: extension_control_path and "directory"
Next
From: Matheus Alcantara
Date:
Subject: Re: extension_control_path and "directory"