Re: ZStandard (with dictionaries) compression support for TOAST compression - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: ZStandard (with dictionaries) compression support for TOAST compression |
Date | |
Msg-id | CA+TgmobSLjaCQV3WkwePn7E25pS-Kov1WbhQ9Co4y6zqsO4nvA@mail.gmail.com Whole thread Raw |
In response to | Re: ZStandard (with dictionaries) compression support for TOAST compression (Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com>) |
List | pgsql-hackers |
On Mon, Apr 28, 2025 at 5:32 PM Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com> wrote: > Thanks for raising that question. The idea behind including a 24-bit > length field alongside the 1-byte algorithm ID is to ensure that each > compressed datum self-describes its metadata size. This allows any > compression algorithm to embed variable-length metadata (up to 16 MB) > without the need for hard-coding header sizes. For instance, an > algorithm in feature might require different metadata lengths for each > datum, and a fixed header size table wouldn’t work. By storing the > length in the header, we maintain a generic and future-proof design. I > would greatly appreciate any feedback on this design. Thanks! I feel like I gave you some feedback on the design already, which was that it seems like a waste of 3 bytes to me. Don't get me wrong: I'm quite impressed by the way you're working on this problem and I hope you stick around and keep working on it and figure something out. But I don't quite understand the point of this response: it seems like you're just restating what the design does without really justifying it. The question here isn't whether a 3-byte header can describe a length up to 16MB; I think we all know our powers of two well enough to agree on the answer to that question. The question is whether it's a good use of 3 bytes, and I don't think it is. I did consider the fact that future compression algorithms might want to use variable-length headers; but I couldn't see a reason why we shouldn't let each of those compression algorithms decide for themselves how to encode whatever information they need. If a compression algorithm needs a variable-length header, then it just needs to make that header self-describing. Worst case scenario, it can make the first byte of that variable-length header a length byte, and then go from there; but it's probably possible to be even smarter and use less than a full byte. Say for example we store a dictionary ID that in concept is a 32-bit quantity but we use a variable-length integer representation for it. It's easy to see that we shouldn't ever need more than 3 bits for that so a full length byte is overkill and, in fact, would undermine the value of a variable-length representation rather severely. (I suspect it's a bad idea anyway, but it's a worse idea if you burn a full byte on a length header.) But there's an even larger question here too, which is why we're having some kind of discussion about generalized metadata when the current project seemingly only requires a 4-byte dictionary OID. If you have some other use of this space in mind, I don't think you've told us what it is. If you don't, then I'm not sure why we're designing around an up-to-16MB variable-length quantity when what we have before us is a 4-byte fixed-length quantity. Moreover, even if you do have some (undisclosed) idea about what else might be stored in this metadata area, why would it be important or even desirable to have the length of that area represented in some uniform way across compression methods? There's no obvious need for any code outside the compression method itself to be able to decompose the Datum into a metadata portion and a payload portion. After all, the metadata portion could be anything so there's no way for anything but the compression method to interpret it usefully. If we do want to have outside code be able to ask questions, we could design some kind of callback interface - e.g. if we end up with multiple compression methods that store dictionary OIDs and they maybe do it in different ways, each could provide an "extract-the-dictionary-OID-from-this-datum" callback and each compression method can implement that however it likes. Maybe you can argue that we will eventually end up with various compression method callbacks each of which is capable of working on the metadata, and so then we might want to take an initial slice of a toasted datum that is just big enough to allow that to work. But that is pretty hypothetical, and in practice the first chunk of the TOAST value (~2k) seems like it'd probably work well for most cases. So, again, if you want us to take seriously the idea of dedicating 3 bytes per Datum to something, you need to give us a really good reason for so doing. The fact a 24-bit metadata length can describe a metadata header of up to 2^24 bits isn't a reason, good or bad. It's just math. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: