Re: Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Re: [HACKERS] Custom compression methods |
Date | |
Msg-id | CA+TgmobSDVgUage9qQ5P_=F_9jaMkCgyKxUQGtFQU7oN4kX-AA@mail.gmail.com Whole thread Raw |
In response to | Re: Re: [HACKERS] Custom compression methods (Alexander Korotkov <a.korotkov@postgrespro.ru>) |
Responses |
Re: [HACKERS] Custom compression methods
Re: Re: [HACKERS] Custom compression methods |
List | pgsql-hackers |
On Thu, Mar 7, 2019 at 2:51 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Yes. I took a look at code of this patch. I think it's in pretty good shape. But high level review/discussion is required. I agree that the code of this patch is in pretty good shape, although there is a lot of rebasing needed at this point. Here is an attempt at some high level review and discussion: - As far as I can see, there is broad agreement that we shouldn't consider ourselves to be locked into 'pglz' forever. I believe numerous people have reported that there are other methods of doing compression that either compress better, or compress faster, or decompress faster, or all of the above. This isn't surprising and nor is it a knock on 'pglz'; Jan committed it in 1999, and it's not surprising that in 20 years some people have come up with better ideas. Not only that, but the quantity and quality of open source software that is available for this kind of thing and for many other kinds of things have improved dramatically in that time. - I can see three possible ways of breaking our dependence on 'pglz' for TOAST compression. Option #1 is to pick one new algorithm which we think is better than 'pglz' in all relevant ways and use it as the default for all new compressed datums. This would be dramatically simpler than what this patch does, because there would be no user interface. It would just be out with the old and in with the new. Option #2 is to create a short list of new algorithms that have different trade-offs; e.g. one that is very fast (like lz4) and one that has an extremely high compression ratio, and provide an interface for users to choose between them. This would be moderately simpler than what this patch does, because we would expose to the user anything about how a new compression method could be added, but it would still require a UI for the user to choose between the available (and hard-coded) options. It has the further advantage that every PostgreSQL cluster will offer the same options (or a subset of them, perhaps, depending on configure flags) and so you don't have to worry that, say, a pg_am row gets lost and suddenly all of your toasted data is inaccessible and uninterpretable. Option #3 is to do what this patch actually does, which is to allow for the addition of any number of compressors, including by extensions. It has the advantage that new compressors can be added with core's permission, so, for example, if it is unclear whether some excellent compressor is free of patent problems, we can elect not to ship support for it in core, while at the same time people who are willing to accept the associated legal risk can add that functionality to their own copy as an extension without having to patch core. The legal climate may even vary by jurisdiction, so what might be questionable in country A might be clearly just fine in country B. Aside from those issues, this approach allows people to experiment and innovate outside of core relatively quickly, instead of being bound by the somewhat cumbrous development process which has left this patch in limbo for the last few years. My view is that option #1 is likely to be impractical, because getting people to agree is hard, and better things are likely to come along later, and people like options. So I prefer either #2 or #3. - The next question is how a datum compressed with some non-default method should be represented on disk. The patch handles this first of all by making the observation that the compressed size can't be >=1GB, because the uncompressed size can't be >=1GB, and we wouldn't have stored it compressed if it expanded. Therefore, the upper two bits of the compressed size should always be zero on disk, and the patch steals one of them to indicate whether "custom" compression is in use. If it is, the 4-byte varlena header is followed not only by a 4-byte size (now with the new flag bit also included) but also by a 4-byte OID, indicating the compression AM in use. I don't think this is a terrible approach, but I don't think it's amazing, either. 4 bytes is quite a bit to use for this; if I guess correctly what will be a typical cluster configuration, you probably would really only need about 2 bits. For a datum that is both stored externally and compressed, the overhead is likely negligible, because the length is probably measured in kB or MB. But for a datum that is compressed but not stored externally, it seems pretty expensive; the datum is probably short, and having an extra 4 bytes of uncompressible data kinda sucks. One possibility would be to allow only one byte here: require each compression AM that is installed to advertise a one-byte value that will denote its compressed datums. If more than one AM tries to claim the same byte value, complain. Another possibility is to abandon this approach and go with #2 from the previous paragraph. Or maybe we add 1 or 2 "privileged" built-in compressors that get dedicated bit-patterns in the upper 2 bits of the size field, with the last bit pattern being reserved for future algorithms. (e.g. 0x00 = pglz, 0x01 = lz4, 0x10 = zstd, 0x11 = something else - see within for details). - I don't really like the use of the phrase "custom compression". I think the terminology needs to be rethought so that we just talk about compression methods. Perhaps in certain contexts we need to specify that we mean extensible compression methods or user-provided compression methods or something like that, but I don't think the word "custom" is very well-suited here. The main point of this shouldn't be for every cluster in the universe to use a different approach to compression, or to compress columns within a database in 47 different ways, but rather to help us get out from under 'pglz'. Eventually we probably want to change the default, but as the patch phrases things now, that default would be a custom method, which is almost a contradiction in terms. - Yet another possible approach to the on-disk format is to leave varatt_external.va_extsize and varattrib_4b.rawsize untouched and instead add new compression methods by adding new vartag_external values. There's quite a lot of bit-space available there: we have a whole byte, and we're currently only using 4 values. We could easily add a half-dozen new possibilities there for new compression methods without sweating the bit-space consumption. The main thing I don't like about this is that it only seems like a useful way to provide for out-of-line compression. Perhaps it could be generalized to allow for inline compression as well, but it seems like it would take some hacking. - One thing I really don't like about the patch is that it consumes a bit from infomask2 for a new flag HEAP_HASCUSTOMCOMPRESSED. infomask bits are at a premium, and there's been no real progress in the decade plus that I've been hanging around here in clawing back any bit-space. I think we really need to avoid burning our remaining bits for anything other than a really critical need, and I don't think I understand what the need is in this case. I might be missing something, but I'd really strongly suggest looking for a way to get rid of this. It also invents the concept of a TupleDesc flag, and the flag invented is TD_ATTR_CUSTOM_COMPRESSED; I'm not sure I see why we need that, either. - It seems like this kind of approach has a sort of built-in circularity problem. It means that every place that might need to detoast a datum needs to be able to access the pg_am catalog. I wonder if that's actually true. For instance, consider logical decoding. I guess that can do catalog lookups in general, but can it do them from the places where detoasting is happening? Moreover, can it do them with the right snapshot? Suppose we rewrite a table to change the compression method, then drop the old compression method, then try to decode a transaction that modified that table before those operations were performed. As an even more extreme example, suppose we need to open pg_am, and to do that we have to build a relcache entry for it, and suppose the relevant pg_class entry had a relacl or reloptions field that happened to be custom-compressed. Or equally suppose that any of the various other tables we use when building a relcache entry had the same kind of problem, especially those that have TOAST tables. We could just disallow the use of non-default compressors in the system catalogs, but the benefits mentioned in http://postgr.es/m/5541614A.5030208@2ndquadrant.com seem too large to ignore. - I think it would be awfully appealing if we could find some way of dividing this great big patch into some somewhat smaller patches. For example: Patch #1. Add syntax allowing a compression method to be specified, but the only possible choice is pglz, and the PRESERVE stuff isn't supported, and changing the value associated with an existing column isn't supported, but we can add tab-completion support and stuff. Patch #2. Add a second built-in method, like gzip or lz4. Patch #3. Add support for changing the compression method associated with a column, forcing a table rewrite. Patch #4. Add support for PRESERVE, so that you can change the compression method associated with a column without forcing a table rewrite, by including the old method in the PRESERVE list, or with a rewrite, by not including it in the PRESERVE list. Patch #5. Add support for compression methods via the AM interface. Perhaps methods added in this manner are prohibited in system catalogs. (This could also go before #4 or even before #3, but with a noticeable hit to usability.) Patch #6 (new development). Add a contrib module using the facility added in #5, perhaps with a slightly off-beat compressor like bzip2 that is more of a niche use case. I think that if the patch set were broken up this way, it would be a lot easier to review and get committed. I think you could commit each bit separately. I don't think you'd want to commit #1 unless you had a sense that #2 was pretty close to done, and similarly for #5 and #6, but that would still make things a lot easier than having one giant monolithic patch, at least IMHO. There might be more to say here, but that's what I have got for now. I hope it helps. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: