Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id 20201005122307.6sebqnn3qnvkvjoc@development
Whole thread Raw
In response to Re: [HACKERS] Custom compression methods  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: [HACKERS] Custom compression methods
List pgsql-hackers
On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
>On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>
>Thanks, Tomas for your feedback.
>
>> 9) attcompression ...
>>
>> The main issue I see is what the patch does with attcompression. Instead
>> of just using it to store a the compression method, it's also used to
>> store the preserved compression methods. And using NameData to store
>> this seems wrong too - if we really want to store this info, the correct
>> way is either using text[] or inventing charvector or similar.
>
>The reason for using the NameData is the get it in the fixed part of
>the data structure.
>

Why do we need that? It's possible to have varlena fields with direct
access (see pg_index.indkey for example). Adding NameData just to make
it fixed-length means we're always adding 64B even if we just need a
single byte, which means ~30% overhead for the FormData_pg_attribute.
That seems a bit unnecessary, and might be an issue with many attributes
(e.g. with many temp tables, etc.).

>> But to me this seems very much like a misuse of attcompression to track
>> dependencies on compression methods, necessary because we don't have a
>> separate catalog listing compression methods. If we had that, I think we
>> could simply add dependencies between attributes and that catalog.
>
>Basically, up to this patch, we are having only built-in compression
>methods and those can not be dropped so we don't need any dependency
>at all.  We just want to know what is the current compression method
>and what is the preserve compression methods supported for this
>attribute.  Maybe we can do it better instead of using the NameData
>but I don't think it makes sense to add a separate catalog?
>

Sure, I understand what the goal was - all I'm saying is that it looks
very much like a workaround needed because we don't have the catalog.

I don't quite understand how could we support custom compression methods
without listing them in some sort of catalog?

>> Moreover, having the catalog would allow adding compression methods
>> (from extensions etc) instead of just having a list of hard-coded
>> compression methods. Which seems like a strange limitation, considering
>> this thread is called "custom compression methods".
>
>I think I forgot to mention while submitting the previous patch that
>the next patch I am planning to submit is, Support creating the custom
>compression methods wherein we can use pg_am catalog to insert the new
>compression method.  And for dependency handling, we can create an
>attribute dependency on the pg_am row.  Basically, we will create the
>attribute dependency on the current compression method AM as well as
>on the preserved compression methods AM.   As part of this, we will
>add two build-in AMs for zlib and pglz, and the attcompression field
>will be converted to the oid_vector (first OID will be of the current
>compression method, followed by the preserved compression method's
>oids).
>

Hmmm, ok. Not sure pg_am is the right place - compression methods don't
quite match what I though AMs are about, but maybe it's just my fault.

FWIW it seems a bit strange to first do the attcompression magic and
then add the catalog later - I think we should start with the catalog
right away. The advantage is that if we end up committing only some of
the patches in this cycle, we already have all the infrastructure etc.
We can reorder that later, though.

>> 10) compression parameters?
>>
>> I wonder if we could/should allow parameters, like compression level
>> (and maybe other stuff, depending on the compression method). PG13
>> allowed that for opclasses, so perhaps we should allow it here too.
>
>Yes, that is also in the plan.   For doing this we are planning to add
>an extra column in the pg_attribute which will store the compression
>options for the current compression method. The original patch was
>creating an extra catalog pg_column_compression, therein it maintains
>the oid of the compression method as well as the compression options.
>The advantage of creating an extra catalog is that we can keep the
>compression options for the preserved compression methods also so that
>we can support the options which can be used for decompressing the
>data as well.  Whereas if we want to avoid this extra catalog then we
>can not use that compression option for decompressing.  But most of
>the options e.g. compression level are just for the compressing so it
>is enough to store for the current compression method only.  What's
>your thoughts?
>

Not sure. My assumption was we'd end up with a new catalog, but maybe
stashing it into pg_attribute is fine. I was really thinking about two
kinds of options - compression level, and some sort of column-level
dictionary. Compression level is not necessary for decompression, but
the dictionary ID would be needed. (I think the global dictionary was
one of the use cases, aimed at JSON compression.)

But I don't think stashing it in pg_attribute means we couldn't use it
for decompression - we'd just need to keep an array of options, one for
each compression method. Keeping it in a separate new catalog might be
cleaner, and I'm not sure how large the configuration might be.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Online checksums patch - once again
Next
From: Ashutosh Bapat
Date:
Subject: Re: Partitionwise join fails under GEQO