Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id CAFiTN-u4S0_ZmtPdt4dvuw7U3x_eiYn+iwwTc17hivp-cCyKog@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Custom compression methods  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] Custom compression methods
List pgsql-hackers
On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >>
> >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> >> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >> >
> >> >> >Thanks, Tomas for your feedback.
> >> >> >
> >> >> >> 9) attcompression ...
> >> >> >>
> >> >> >> The main issue I see is what the patch does with attcompression. Instead
> >> >> >> of just using it to store a the compression method, it's also used to
> >> >> >> store the preserved compression methods. And using NameData to store
> >> >> >> this seems wrong too - if we really want to store this info, the correct
> >> >> >> way is either using text[] or inventing charvector or similar.
> >> >> >
> >> >> >The reason for using the NameData is the get it in the fixed part of
> >> >> >the data structure.
> >> >> >
> >> >>
> >> >> Why do we need that? It's possible to have varlena fields with direct
> >> >> access (see pg_index.indkey for example).
> >> >
> >> >I see.  While making it NameData I was thinking whether we have an
> >> >option to direct access the varlena. Thanks for pointing me there. I
> >> >will change this.
> >> >
> >> > Adding NameData just to make
> >> >> it fixed-length means we're always adding 64B even if we just need a
> >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
> >> >> That seems a bit unnecessary, and might be an issue with many attributes
> >> >> (e.g. with many temp tables, etc.).
> >> >
> >> >You are right.  Even I did not like to keep 64B for this, so I will change it.
> >> >
> >> >>
> >> >> >> But to me this seems very much like a misuse of attcompression to track
> >> >> >> dependencies on compression methods, necessary because we don't have a
> >> >> >> separate catalog listing compression methods. If we had that, I think we
> >> >> >> could simply add dependencies between attributes and that catalog.
> >> >> >
> >> >> >Basically, up to this patch, we are having only built-in compression
> >> >> >methods and those can not be dropped so we don't need any dependency
> >> >> >at all.  We just want to know what is the current compression method
> >> >> >and what is the preserve compression methods supported for this
> >> >> >attribute.  Maybe we can do it better instead of using the NameData
> >> >> >but I don't think it makes sense to add a separate catalog?
> >> >> >
> >> >>
> >> >> Sure, I understand what the goal was - all I'm saying is that it looks
> >> >> very much like a workaround needed because we don't have the catalog.
> >> >>
> >> >> I don't quite understand how could we support custom compression methods
> >> >> without listing them in some sort of catalog?
> >> >
> >> >Yeah for supporting custom compression we need some catalog.
> >> >
> >> >> >> Moreover, having the catalog would allow adding compression methods
> >> >> >> (from extensions etc) instead of just having a list of hard-coded
> >> >> >> compression methods. Which seems like a strange limitation, considering
> >> >> >> this thread is called "custom compression methods".
> >> >> >
> >> >> >I think I forgot to mention while submitting the previous patch that
> >> >> >the next patch I am planning to submit is, Support creating the custom
> >> >> >compression methods wherein we can use pg_am catalog to insert the new
> >> >> >compression method.  And for dependency handling, we can create an
> >> >> >attribute dependency on the pg_am row.  Basically, we will create the
> >> >> >attribute dependency on the current compression method AM as well as
> >> >> >on the preserved compression methods AM.   As part of this, we will
> >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
> >> >> >will be converted to the oid_vector (first OID will be of the current
> >> >> >compression method, followed by the preserved compression method's
> >> >> >oids).
> >> >> >
> >> >>
> >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
> >> >> quite match what I though AMs are about, but maybe it's just my fault.
> >> >>
> >> >> FWIW it seems a bit strange to first do the attcompression magic and
> >> >> then add the catalog later - I think we should start with the catalog
> >> >> right away. The advantage is that if we end up committing only some of
> >> >> the patches in this cycle, we already have all the infrastructure etc.
> >> >> We can reorder that later, though.
> >> >
> >> >Hmm,  yeah we can do this way as well that first create a new catalog
> >> >table and add entries for these two built-in methods and the
> >> >attcompression can store the oid vector.  But if we only commit the
> >> >build-in compression methods part then does it make sense to create an
> >> >extra catalog or adding these build-in methods to the existing catalog
> >> >(if we plan to use pg_am).  Then in attcompression instead of using
> >> >one byte for each preserve compression method, we need to use oid.  So
> >> >from Robert's mail[1], it appeared to me that he wants that the
> >> >build-in compression methods part should be independently committable
> >> >and if we think from that perspective then adding a catalog doesn't
> >> >make much sense.  But if we are planning to commit the custom method
> >> >also then it makes more sense to directly start with the catalog
> >> >because that way it will be easy to expand without much refactoring.
> >> >
> >> >[1]
https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
> >> >
> >>
> >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
> >> interpreted in various ways - it does not really say whether the initial
> >> list of built-in methods should be in some C array, or already in a proper
> >> catalog.
> >>
> >> All I'm saying is it seems a bit weird to first implement dependencies
> >> based on strange (mis)use of attcompression attribute, and then replace
> >> it with a proper catalog. My understanding is those patches are expected
> >> to be committable one by one, but the attcompression approach seems a
> >> bit too hacky to me - not sure I'd want to commit that ...
> >
> >Okay, I will change this.  So I will make create a new catalog
> >pg_compression and add the entry for two built-in compression methods
> >from the very first patch.
> >
>
> OK.
>
> >> >> >> 10) compression parameters?
> >> >> >>
> >> >> >> I wonder if we could/should allow parameters, like compression level
> >> >> >> (and maybe other stuff, depending on the compression method). PG13
> >> >> >> allowed that for opclasses, so perhaps we should allow it here too.
> >> >> >
> >> >> >Yes, that is also in the plan.   For doing this we are planning to add
> >> >> >an extra column in the pg_attribute which will store the compression
> >> >> >options for the current compression method. The original patch was
> >> >> >creating an extra catalog pg_column_compression, therein it maintains
> >> >> >the oid of the compression method as well as the compression options.
> >> >> >The advantage of creating an extra catalog is that we can keep the
> >> >> >compression options for the preserved compression methods also so that
> >> >> >we can support the options which can be used for decompressing the
> >> >> >data as well.  Whereas if we want to avoid this extra catalog then we
> >> >> >can not use that compression option for decompressing.  But most of
> >> >> >the options e.g. compression level are just for the compressing so it
> >> >> >is enough to store for the current compression method only.  What's
> >> >> >your thoughts?
> >> >> >
> >> >>
> >> >> Not sure. My assumption was we'd end up with a new catalog, but maybe
> >> >> stashing it into pg_attribute is fine. I was really thinking about two
> >> >> kinds of options - compression level, and some sort of column-level
> >> >> dictionary. Compression level is not necessary for decompression, but
> >> >> the dictionary ID would be needed. (I think the global dictionary was
> >> >> one of the use cases, aimed at JSON compression.)
> >> >
> >> >Ok
> >> >
> >> >> But I don't think stashing it in pg_attribute means we couldn't use it
> >> >> for decompression - we'd just need to keep an array of options, one for
> >> >> each compression method.
> >> >
> >> >Yeah, we can do that.
> >> >
> >> >Keeping it in a separate new catalog might be
> >> >> cleaner, and I'm not sure how large the configuration might be.
> >> >
> >> >Yeah in that case it will be better to store in a separate catalog,
> >> >because sometimes if multiple attributes are using the same
> >> >compression method with the same options then we can store the same
> >> >oid in attcompression instead of duplicating the option field.
> >> >
> >>
> >> I doubt deduplicating the options like this is (sharing options between
> >> columns) is really worth it, as it means extra complexity e.g. during
> >> ALTER TABLE ... SET COMPRESSION. I don't think we do that for other
> >> catalogs, so why should we do it here?
> >
> >Yeah, valid point.
> >
> >>
> >> Ultimately I think it's a question of how large we expect the options to
> >> be, and how flexible it needs to be.
> >>
> >> For example, what happens if the user does this:
> >>
> >>    ALTER ... SET COMPRESSION my_compression WITH (options1) PRESERVE;
> >>    ALTER ... SET COMPRESSION pglz PRESERVE;
> >>    ALTER ... SET COMPRESSION my_compression WITH (options2) PRESERVE;
> >>
> >> I believe it's enough to keep just the last value, but maybe I'm wrong
> >> and we need to keep the whole history?
> >
> >Currently, the syntax is like ALTER ... SET COMPRESSION my_compression
> >WITH (options1) PRESERVE (old_compression1, old_compression2..).  But I
> >think if the user just gives PRESERVE without a list then we should
> >just preserve the latest one.
> >
>
> Hmmm. Not sure that's very convenient. I'd expect the most common use
> case for PRESERVE being "I want to change compression for new data,
> without rewrite". If PRESERVE by default preserves the latest one, that
> pretty much forces users to always list all methods. I suggest
> iterpreting it as "preserve everything" instead.
>
> Another option would be to require either a list of methods, or some
> keyword defining what to preserve. Like for example
>
>      ... PRESERVE (m1, m2, ...)
>      ... PRESERVE ALL
>      ... PRESERVE LAST
>
> Does that make sense?

Yeah, this makes sense to me.

>
> >> The use case I'm thinking about is the column-level JSON compression,
> >> where one of the options identifies the dictionary. OTOH I'm not sure
> >> this is the right way to track this info - we need to know which options
> >> were compressed with which options, i.e. it needs to be encoded in each
> >> value directly. It'd also require changes to the PRESERVE handling
> >> because it'd be necessary to identify which options to preserve ...
> >>
> >> So maybe this is either nonsense or something we don't want to support,
> >> and we should only allow one option for each compression method.
> >
> >Yeah, it is a bit confusing to add the same compression method with
> >different compression options,  then in the preserve list, we will
> >have to allow the option as well along with the compression method to
> >know which compression method with what options we want to preserve.
> >
> >And also as you mentioned that in rows we need to know the option as
> >well.  I think for solving this anyways for the custom compression
> >methods we will have to store the OID of the compression method in the
> >toast header so we can provide an intermediate catalog which will
> >create a new row for each combination of compression method + option
> >and the toast header can store the OID of that row so that we know
> >with which compression method + option it was compressed with.
> >
>
> I agree. After thinking about this a bit more, I think we should just
> keep the last options for each compression method. If we need to allow
> multiple options for some future compression method, we can improve
> this, but until then it'd be an over-engineering. Let's do the simplest
> possible thing here.

Okay.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: nbtpage.c:356 Expression 'metad->btm_root != P_NONE' is always false.
Next
From: Thomas Munro
Date:
Subject: Re: Two fsync related performance issues?