Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id CAFiTN-sBH65e4B5kQB5MT8BQOPajkrKvSdKA-07+ra3f2x4vCA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Custom compression methods  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Custom compression methods  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
On Thu, Mar 18, 2021 at 1:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> ).On Mon, Mar 15, 2021 at 6:58 PM Andres Freund <andres@anarazel.de> wrote:
> > - Adding all these indirect function calls via toast_compression[] just
> >   for all of two builtin methods isn't fun either.
>
> Yeah, it feels like this has too many layers of indirection now. Like,
> toast_decompress_datum() first gets TOAST_COMPRESS_METHOD(attr). Then
> it calls CompressionIdToMethod to convert one constant (like
> TOAST_PGLZ_COMPRESSION_ID) to another constant with a slightly
> different name (like TOAST_PGLZ_COMPRESSION). Then it calls
> GetCompressionRoutines() to get hold of the function pointers. Then it
> does an indirect functional call. That seemed like a pretty reasonable
> idea when we were trying to support arbitrary compression AMs without
> overly privileging the stuff that was built into core, but if we're
> just doing stuff that's built into core, then we could just switch
> (TOAST_COMPRESS_METHOD(attr)) and call the correct function. In fact,
> we could even move the stuff from toast_compression.c into detoast.c,
> which would allow the compiler to optimize better (e.g. by inlining,
> if it wants).
>
> The same applies to toast_decompress_datum_slice().

Changed this, but I have still kept the functions in
toast_compression.c.  I think keeping compression related
functionality in a separate file looks much cleaner.  Please have a
look and let me know that whether you still feel we should move it ti
detoast.c.  If the reason is that we can inline, then I feel we are
already paying cost of compression/decompression and compare to that
in lining a function will not make much difference.

> There's a similar issue in toast_get_compression_method() and the only
> caller, pg_column_compression(). Here the multiple mapping layers and
> the indirect function call are split across those two functions rather
> than all in the same one, but here again one could presumably find a
> place to just switch on TOAST_COMPRESS_METHOD(attr) or
> VARATT_EXTERNAL_GET_COMPRESSION(attr) and return "pglz" or "lz4"
> directly.

I have simplified that, only one level of function call from
pg_column_compression,  I have kept a toast_get_compression_id
function because in later patch 0005, we will be using that for
getting the compression id from the compressed data.

> In toast_compress_datum(), I think we could have a switch that invokes
> the appropriate compressor based on cmethod and sets a variable to the
> value to be passed as the final argument of
> TOAST_COMPRESS_SET_SIZE_AND_METHOD().

Done

> Likewise, I suppose CompressionNameToMethod could at least be
> simplified to use constant strings rather than stuff like
> toast_compression[TOAST_PGLZ_COMPRESSION_ID].cmname.

Done


Other changes:
- As suggested by Andres, remove compression method comparision from
eualTupleDesc, because it is not required now.
- I found one problem in existing patch, the problem was in
detoast_attr_slice, if externally stored data is compressed then we
compute max possible compressed size to fetch based on the slice
length, for that we were using pglz_maximum_compressed_size, which is
not correct for lz4.  For lz4, I think we need to fetch the complete
compressed data.  We might think that for lz4 we might compute lie
Min(LZ4_compressBound(slicelength, total_compressed_size);  But IMHO,
we can not do that and the reason is same that why we should not use
PGLZ_MAX_OUTPUT for pglz (explained in the comment atop
pglz_maximum_compressed_size).


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: fdatasync performance problem with large number of DB files
Next
From: Amit Kapila
Date:
Subject: Re: Parallel INSERT (INTO ... SELECT ...)