Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: [HACKERS] Custom compression methods |
Date | |
Msg-id | CAFiTN-u4S0_ZmtPdt4dvuw7U3x_eiYn+iwwTc17hivp-cCyKog@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Custom compression methods (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: [HACKERS] Custom compression methods
|
List | pgsql-hackers |
On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote: > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra > ><tomas.vondra@2ndquadrant.com> wrote: > >> > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote: > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra > >> ><tomas.vondra@2ndquadrant.com> wrote: > >> >> > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote: > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra > >> >> ><tomas.vondra@2ndquadrant.com> wrote: > >> >> > > >> >> >Thanks, Tomas for your feedback. > >> >> > > >> >> >> 9) attcompression ... > >> >> >> > >> >> >> The main issue I see is what the patch does with attcompression. Instead > >> >> >> of just using it to store a the compression method, it's also used to > >> >> >> store the preserved compression methods. And using NameData to store > >> >> >> this seems wrong too - if we really want to store this info, the correct > >> >> >> way is either using text[] or inventing charvector or similar. > >> >> > > >> >> >The reason for using the NameData is the get it in the fixed part of > >> >> >the data structure. > >> >> > > >> >> > >> >> Why do we need that? It's possible to have varlena fields with direct > >> >> access (see pg_index.indkey for example). > >> > > >> >I see. While making it NameData I was thinking whether we have an > >> >option to direct access the varlena. Thanks for pointing me there. I > >> >will change this. > >> > > >> > Adding NameData just to make > >> >> it fixed-length means we're always adding 64B even if we just need a > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute. > >> >> That seems a bit unnecessary, and might be an issue with many attributes > >> >> (e.g. with many temp tables, etc.). > >> > > >> >You are right. Even I did not like to keep 64B for this, so I will change it. > >> > > >> >> > >> >> >> But to me this seems very much like a misuse of attcompression to track > >> >> >> dependencies on compression methods, necessary because we don't have a > >> >> >> separate catalog listing compression methods. If we had that, I think we > >> >> >> could simply add dependencies between attributes and that catalog. > >> >> > > >> >> >Basically, up to this patch, we are having only built-in compression > >> >> >methods and those can not be dropped so we don't need any dependency > >> >> >at all. We just want to know what is the current compression method > >> >> >and what is the preserve compression methods supported for this > >> >> >attribute. Maybe we can do it better instead of using the NameData > >> >> >but I don't think it makes sense to add a separate catalog? > >> >> > > >> >> > >> >> Sure, I understand what the goal was - all I'm saying is that it looks > >> >> very much like a workaround needed because we don't have the catalog. > >> >> > >> >> I don't quite understand how could we support custom compression methods > >> >> without listing them in some sort of catalog? > >> > > >> >Yeah for supporting custom compression we need some catalog. > >> > > >> >> >> Moreover, having the catalog would allow adding compression methods > >> >> >> (from extensions etc) instead of just having a list of hard-coded > >> >> >> compression methods. Which seems like a strange limitation, considering > >> >> >> this thread is called "custom compression methods". > >> >> > > >> >> >I think I forgot to mention while submitting the previous patch that > >> >> >the next patch I am planning to submit is, Support creating the custom > >> >> >compression methods wherein we can use pg_am catalog to insert the new > >> >> >compression method. And for dependency handling, we can create an > >> >> >attribute dependency on the pg_am row. Basically, we will create the > >> >> >attribute dependency on the current compression method AM as well as > >> >> >on the preserved compression methods AM. As part of this, we will > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field > >> >> >will be converted to the oid_vector (first OID will be of the current > >> >> >compression method, followed by the preserved compression method's > >> >> >oids). > >> >> > > >> >> > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't > >> >> quite match what I though AMs are about, but maybe it's just my fault. > >> >> > >> >> FWIW it seems a bit strange to first do the attcompression magic and > >> >> then add the catalog later - I think we should start with the catalog > >> >> right away. The advantage is that if we end up committing only some of > >> >> the patches in this cycle, we already have all the infrastructure etc. > >> >> We can reorder that later, though. > >> > > >> >Hmm, yeah we can do this way as well that first create a new catalog > >> >table and add entries for these two built-in methods and the > >> >attcompression can store the oid vector. But if we only commit the > >> >build-in compression methods part then does it make sense to create an > >> >extra catalog or adding these build-in methods to the existing catalog > >> >(if we plan to use pg_am). Then in attcompression instead of using > >> >one byte for each preserve compression method, we need to use oid. So > >> >from Robert's mail[1], it appeared to me that he wants that the > >> >build-in compression methods part should be independently committable > >> >and if we think from that perspective then adding a catalog doesn't > >> >make much sense. But if we are planning to commit the custom method > >> >also then it makes more sense to directly start with the catalog > >> >because that way it will be easy to expand without much refactoring. > >> > > >> >[1] https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com > >> > > >> > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be > >> interpreted in various ways - it does not really say whether the initial > >> list of built-in methods should be in some C array, or already in a proper > >> catalog. > >> > >> All I'm saying is it seems a bit weird to first implement dependencies > >> based on strange (mis)use of attcompression attribute, and then replace > >> it with a proper catalog. My understanding is those patches are expected > >> to be committable one by one, but the attcompression approach seems a > >> bit too hacky to me - not sure I'd want to commit that ... > > > >Okay, I will change this. So I will make create a new catalog > >pg_compression and add the entry for two built-in compression methods > >from the very first patch. > > > > OK. > > >> >> >> 10) compression parameters? > >> >> >> > >> >> >> I wonder if we could/should allow parameters, like compression level > >> >> >> (and maybe other stuff, depending on the compression method). PG13 > >> >> >> allowed that for opclasses, so perhaps we should allow it here too. > >> >> > > >> >> >Yes, that is also in the plan. For doing this we are planning to add > >> >> >an extra column in the pg_attribute which will store the compression > >> >> >options for the current compression method. The original patch was > >> >> >creating an extra catalog pg_column_compression, therein it maintains > >> >> >the oid of the compression method as well as the compression options. > >> >> >The advantage of creating an extra catalog is that we can keep the > >> >> >compression options for the preserved compression methods also so that > >> >> >we can support the options which can be used for decompressing the > >> >> >data as well. Whereas if we want to avoid this extra catalog then we > >> >> >can not use that compression option for decompressing. But most of > >> >> >the options e.g. compression level are just for the compressing so it > >> >> >is enough to store for the current compression method only. What's > >> >> >your thoughts? > >> >> > > >> >> > >> >> Not sure. My assumption was we'd end up with a new catalog, but maybe > >> >> stashing it into pg_attribute is fine. I was really thinking about two > >> >> kinds of options - compression level, and some sort of column-level > >> >> dictionary. Compression level is not necessary for decompression, but > >> >> the dictionary ID would be needed. (I think the global dictionary was > >> >> one of the use cases, aimed at JSON compression.) > >> > > >> >Ok > >> > > >> >> But I don't think stashing it in pg_attribute means we couldn't use it > >> >> for decompression - we'd just need to keep an array of options, one for > >> >> each compression method. > >> > > >> >Yeah, we can do that. > >> > > >> >Keeping it in a separate new catalog might be > >> >> cleaner, and I'm not sure how large the configuration might be. > >> > > >> >Yeah in that case it will be better to store in a separate catalog, > >> >because sometimes if multiple attributes are using the same > >> >compression method with the same options then we can store the same > >> >oid in attcompression instead of duplicating the option field. > >> > > >> > >> I doubt deduplicating the options like this is (sharing options between > >> columns) is really worth it, as it means extra complexity e.g. during > >> ALTER TABLE ... SET COMPRESSION. I don't think we do that for other > >> catalogs, so why should we do it here? > > > >Yeah, valid point. > > > >> > >> Ultimately I think it's a question of how large we expect the options to > >> be, and how flexible it needs to be. > >> > >> For example, what happens if the user does this: > >> > >> ALTER ... SET COMPRESSION my_compression WITH (options1) PRESERVE; > >> ALTER ... SET COMPRESSION pglz PRESERVE; > >> ALTER ... SET COMPRESSION my_compression WITH (options2) PRESERVE; > >> > >> I believe it's enough to keep just the last value, but maybe I'm wrong > >> and we need to keep the whole history? > > > >Currently, the syntax is like ALTER ... SET COMPRESSION my_compression > >WITH (options1) PRESERVE (old_compression1, old_compression2..). But I > >think if the user just gives PRESERVE without a list then we should > >just preserve the latest one. > > > > Hmmm. Not sure that's very convenient. I'd expect the most common use > case for PRESERVE being "I want to change compression for new data, > without rewrite". If PRESERVE by default preserves the latest one, that > pretty much forces users to always list all methods. I suggest > iterpreting it as "preserve everything" instead. > > Another option would be to require either a list of methods, or some > keyword defining what to preserve. Like for example > > ... PRESERVE (m1, m2, ...) > ... PRESERVE ALL > ... PRESERVE LAST > > Does that make sense? Yeah, this makes sense to me. > > >> The use case I'm thinking about is the column-level JSON compression, > >> where one of the options identifies the dictionary. OTOH I'm not sure > >> this is the right way to track this info - we need to know which options > >> were compressed with which options, i.e. it needs to be encoded in each > >> value directly. It'd also require changes to the PRESERVE handling > >> because it'd be necessary to identify which options to preserve ... > >> > >> So maybe this is either nonsense or something we don't want to support, > >> and we should only allow one option for each compression method. > > > >Yeah, it is a bit confusing to add the same compression method with > >different compression options, then in the preserve list, we will > >have to allow the option as well along with the compression method to > >know which compression method with what options we want to preserve. > > > >And also as you mentioned that in rows we need to know the option as > >well. I think for solving this anyways for the custom compression > >methods we will have to store the OID of the compression method in the > >toast header so we can provide an intermediate catalog which will > >create a new row for each combination of compression method + option > >and the toast header can store the OID of that row so that we know > >with which compression method + option it was compressed with. > > > > I agree. After thinking about this a bit more, I think we should just > keep the last options for each compression method. If we need to allow > multiple options for some future compression method, we can improve > this, but until then it'd be an over-engineering. Let's do the simplest > possible thing here. Okay. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: