Home > mailing lists

Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: [HACKERS] Custom compression methods
Date	November 20, 2017 05:04:53
Msg-id	62e46a47-08a8-6e06-5e4f-6f52e0a13202@2ndquadrant.com Whole thread Raw
In response to	Re: [HACKERS] Custom compression methods (Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru>)
Responses	Re: [HACKERS] Custom compression methods
List	pgsql-hackers

Tree view

Hi,

On 11/14/2017 02:23 PM, Ildus Kurbangaliev wrote:
>
> ...
>
> Attached version 4 of the patch. Fixed pg_upgrade and few other bugs.
> 

I did a review of this today, and I think there are some things that
need improvement / fixing.

Firstly, some basic comments from just eye-balling the diff, then some
bugs I discovered after writing an extension adding lz4.

1) formatRelOptions/freeRelOptions are no longer needed (I see Ildar
already pointer that out)

2) There's unnecessary whitespace (extra newlines) on a couple of
places, which is needlessly increasing the size of the patch. Small
difference, but annoying.

3) tuptoaster.c

Why do you change 'info' from int32 to uint32? Seems unnecessary.

Adding new 'att' variable in toast_insert_or_update is confusing, as
there already is 'att' in the very next loop. Technically it's correct,
but I'd bet it'll lead to some WTF?! moments later. I propose to just
use TupleDescAttr(tupleDesc,i) on the two places where it matters,
around line 808.

There are no comments for init_compression_options_htab and
get_compression_options_info, so that needs to be fixed. Moreover, the
names are confusing because what we really get is not just 'options' but
the compression routines too.

4) gen_db_file_maps probably shouldn't do the fprints, right?

5) not sure why you modify src/tools/pgindent/exclude_file_patterns

6) I'm rather confused by AttributeCompression vs. ColumnCompression. I
mean, attribute==column, right? Of course, one is for data from parser,
the other one is for internal info. But can we make the naming clearer?

7) The docs in general are somewhat unsatisfactory, TBH. For example the
ColumnCompression has no comments, unlike everything else in parsenodes.
Similarly for the SGML docs - I suggest to expand them to resemble FDW
docs (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
also follows the handler/routines pattern.

8) One of the unclear things if why we even need 'drop' routing. It
seems that if it's defined DropAttributeCompression does something. But
what should it do? I suppose dropping the options should be done using
dependencies (just like we drop columns in this case).

BTW why does DropAttributeCompression mess with att->attisdropped in
this way? That seems a bit odd.

9) configure routines that only check if (options != NIL) and then error
out (like tsvector_configure) seem a bit unnecessary. Just allow it to
be NULL in CompressionMethodRoutine, and throw an error if options is
not NIL for such compression method.

10) toast_compress_datum still does this:

    if (!ac && (valsize < PGLZ_strategy_default->min_input_size ||
                valsize > PGLZ_strategy_default->max_input_size))

which seems rather pglz-specific (the naming is a hint). Why shouldn't
this be specific to compression, exposed either as min/max constants, or
wrapped in another routine - size_is_valid() or something like that?

11) The comments in toast_compress_datum probably need updating, as it
still references to pglz specifically. I guess the new compression
methods do matter too.

12) get_compression_options_info organizes the compression info into a
hash table by OID. The hash table implementation assumes the hash key is
at the beginning of the entry, but AttributeCompression is defined like
this:

    typedef struct
    {
        CompressionMethodRoutine *routine;
        List  *options;
        Oid cmoptoid;
    } AttributeCompression;

Which means get_compression_options_info is busted, will never lookup
anything, and the hash table will grow by adding more and more entries
into the same bucket. Of course, this has extremely negative impact on
performance (pretty much arbitrarily bad, depending on how many entries
you've already added to the hash table).

Moving the OID to the beginning of the struct fixes the issue.

13) When writing the experimental extension, I was extremely confused
about the regular varlena headers, custom compression headers, etc. In
the end I stole the code from tsvector.c and whacked it a bit until it
worked, but I wouldn't dare to claim I understand how it works.

This needs to be documented somewhere. For example postgres.h has a
bunch of paragraphs about varlena headers, so perhaps it should be
there? I see the patch tweaks some of the constants, but does not update
the comment at all.

Perhaps it would be useful to provide some additional macros making
access to custom-compressed varlena values easier. Or perhaps the
VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already support that? This
part is not very clear to me.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pg_lz4.tgz

pgsql-hackers by date:

From: Andrew Dunstan
Date: 20 November 2017, 04:56:29
Subject: Re: [HACKERS] [PATCH] A hook for session start

From: Tomas Vondra
Date: 20 November 2017, 05:23:23
Subject: Re: [HACKERS] Custom compression methods

Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

Attachment

Previous

Next