Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id 20201008215408.bvrdmt7z2qug7lan@development
Whole thread Raw
In response to Re: [HACKERS] Custom compression methods  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: [HACKERS] Custom compression methods
Re: [HACKERS] Custom compression methods
List pgsql-hackers
On Thu, Oct 08, 2020 at 02:38:27PM +0530, Dilip Kumar wrote:
>On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
>> > <tomas.vondra@2ndquadrant.com> wrote:
>> > >
>> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
>> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
>> > > ><tomas.vondra@2ndquadrant.com> wrote:
>> > > >>
>> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
>> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
>> > > >> ><tomas.vondra@2ndquadrant.com> wrote:
>> > > >> >>
>> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
>> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
>> > > >> >> ><tomas.vondra@2ndquadrant.com> wrote:
>> > > >> >> >
>> > > >> >> >Thanks, Tomas for your feedback.
>> > > >> >> >
>> > > >> >> >> 9) attcompression ...
>> > > >> >> >>
>> > > >> >> >> The main issue I see is what the patch does with attcompression. Instead
>> > > >> >> >> of just using it to store a the compression method, it's also used to
>> > > >> >> >> store the preserved compression methods. And using NameData to store
>> > > >> >> >> this seems wrong too - if we really want to store this info, the correct
>> > > >> >> >> way is either using text[] or inventing charvector or similar.
>> > > >> >> >
>> > > >> >> >The reason for using the NameData is the get it in the fixed part of
>> > > >> >> >the data structure.
>> > > >> >> >
>> > > >> >>
>> > > >> >> Why do we need that? It's possible to have varlena fields with direct
>> > > >> >> access (see pg_index.indkey for example).
>> > > >> >
>> > > >> >I see.  While making it NameData I was thinking whether we have an
>> > > >> >option to direct access the varlena. Thanks for pointing me there. I
>> > > >> >will change this.
>> > > >> >
>> > > >> > Adding NameData just to make
>> > > >> >> it fixed-length means we're always adding 64B even if we just need a
>> > > >> >> single byte, which means ~30% overhead for the FormData_pg_attribute.
>> > > >> >> That seems a bit unnecessary, and might be an issue with many attributes
>> > > >> >> (e.g. with many temp tables, etc.).
>> > > >> >
>> > > >> >You are right.  Even I did not like to keep 64B for this, so I will change it.
>> > > >> >
>> > > >> >>
>> > > >> >> >> But to me this seems very much like a misuse of attcompression to track
>> > > >> >> >> dependencies on compression methods, necessary because we don't have a
>> > > >> >> >> separate catalog listing compression methods. If we had that, I think we
>> > > >> >> >> could simply add dependencies between attributes and that catalog.
>> > > >> >> >
>> > > >> >> >Basically, up to this patch, we are having only built-in compression
>> > > >> >> >methods and those can not be dropped so we don't need any dependency
>> > > >> >> >at all.  We just want to know what is the current compression method
>> > > >> >> >and what is the preserve compression methods supported for this
>> > > >> >> >attribute.  Maybe we can do it better instead of using the NameData
>> > > >> >> >but I don't think it makes sense to add a separate catalog?
>> > > >> >> >
>> > > >> >>
>> > > >> >> Sure, I understand what the goal was - all I'm saying is that it looks
>> > > >> >> very much like a workaround needed because we don't have the catalog.
>> > > >> >>
>> > > >> >> I don't quite understand how could we support custom compression methods
>> > > >> >> without listing them in some sort of catalog?
>> > > >> >
>> > > >> >Yeah for supporting custom compression we need some catalog.
>> > > >> >
>> > > >> >> >> Moreover, having the catalog would allow adding compression methods
>> > > >> >> >> (from extensions etc) instead of just having a list of hard-coded
>> > > >> >> >> compression methods. Which seems like a strange limitation, considering
>> > > >> >> >> this thread is called "custom compression methods".
>> > > >> >> >
>> > > >> >> >I think I forgot to mention while submitting the previous patch that
>> > > >> >> >the next patch I am planning to submit is, Support creating the custom
>> > > >> >> >compression methods wherein we can use pg_am catalog to insert the new
>> > > >> >> >compression method.  And for dependency handling, we can create an
>> > > >> >> >attribute dependency on the pg_am row.  Basically, we will create the
>> > > >> >> >attribute dependency on the current compression method AM as well as
>> > > >> >> >on the preserved compression methods AM.   As part of this, we will
>> > > >> >> >add two build-in AMs for zlib and pglz, and the attcompression field
>> > > >> >> >will be converted to the oid_vector (first OID will be of the current
>> > > >> >> >compression method, followed by the preserved compression method's
>> > > >> >> >oids).
>> > > >> >> >
>> > > >> >>
>> > > >> >> Hmmm, ok. Not sure pg_am is the right place - compression methods don't
>> > > >> >> quite match what I though AMs are about, but maybe it's just my fault.
>> > > >> >>
>> > > >> >> FWIW it seems a bit strange to first do the attcompression magic and
>> > > >> >> then add the catalog later - I think we should start with the catalog
>> > > >> >> right away. The advantage is that if we end up committing only some of
>> > > >> >> the patches in this cycle, we already have all the infrastructure etc.
>> > > >> >> We can reorder that later, though.
>> > > >> >
>> > > >> >Hmm,  yeah we can do this way as well that first create a new catalog
>> > > >> >table and add entries for these two built-in methods and the
>> > > >> >attcompression can store the oid vector.  But if we only commit the
>> > > >> >build-in compression methods part then does it make sense to create an
>> > > >> >extra catalog or adding these build-in methods to the existing catalog
>> > > >> >(if we plan to use pg_am).  Then in attcompression instead of using
>> > > >> >one byte for each preserve compression method, we need to use oid.  So
>> > > >> >from Robert's mail[1], it appeared to me that he wants that the
>> > > >> >build-in compression methods part should be independently committable
>> > > >> >and if we think from that perspective then adding a catalog doesn't
>> > > >> >make much sense.  But if we are planning to commit the custom method
>> > > >> >also then it makes more sense to directly start with the catalog
>> > > >> >because that way it will be easy to expand without much refactoring.
>> > > >> >
>> > > >> >[1]
https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
>> > > >> >
>> > > >>
>> > > >> Hmmm. Maybe I'm missing something subtle, but I think that plan can be
>> > > >> interpreted in various ways - it does not really say whether the initial
>> > > >> list of built-in methods should be in some C array, or already in a proper
>> > > >> catalog.
>> > > >>
>> > > >> All I'm saying is it seems a bit weird to first implement dependencies
>> > > >> based on strange (mis)use of attcompression attribute, and then replace
>> > > >> it with a proper catalog. My understanding is those patches are expected
>> > > >> to be committable one by one, but the attcompression approach seems a
>> > > >> bit too hacky to me - not sure I'd want to commit that ...
>> > > >
>> > > >Okay, I will change this.  So I will make create a new catalog
>> > > >pg_compression and add the entry for two built-in compression methods
>> > > >from the very first patch.
>> > > >
>> > >
>> > > OK.
>>
>> I have changed the first 2 patches, basically, now we are providing a
>> new catalog pg_compression and the pg_attribute is storing the oid of
>> the compression method.  The patches still need some cleanup and there
>> is also one open comment that for index we should use its table
>> compression.
>>
>> I am still working on the preserve patch.  For preserving the
>> compression method I am planning to convert the attcompression field
>> to the oidvector so that we can store the oid of the preserve method
>> also.  I am not sure whether we can access this oidvector as a fixed
>> part of the FormData_pg_attribute or not.  The reason is that for
>> building the tuple descriptor, we need to give the size of the fixed
>> part (#define ATTRIBUTE_FIXED_PART_SIZE \
>> (offsetof(FormData_pg_attribute,attcompression) + sizeof(Oid))).  But
>> if we convert this to the oidvector then we don't know the size of the
>> fixed part.  Am I missing something?
>
>I could think of two solutions here
>Sol1.
>Make the first oid of the oidvector as part of the fixed size, like below
>#define ATTRIBUTE_FIXED_PART_SIZE \
>(offsetof(FormData_pg_attribute, attcompression) + OidVectorSize(1))
>
>Sol2:
>Keep attcompression as oid only and for the preserve list, adds
>another field in the variable part which will be of type oidvector.  I
>think most of the time we need to access the current compression
>method and with this solution, we will be able to access that as part
>of the tuple desc.
>

And is the oidvector actually needed? If we have the extra catalog,
can't we track this simply using the regular dependencies? So we'd have
the attcompression OID of the current compression method, and the
preserved values would be tracked in pg_depend.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Parallel INSERT (INTO ... SELECT ...)
Next
From: Tomas Vondra
Date:
Subject: Re: POC: postgres_fdw insert batching