Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id CAFiTN-v8a7PMTWWwLG4ASNw9GZwhFh40TEwLz8oLjUnhbUGqTw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Custom compression methods  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Custom compression methods  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Sat, Nov 21, 2020 at 3:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

Most of the comments looks fine to me but I have a slightly different
opinion for one of the comment so replying only for that.

> I'm worried about how expensive this might be, and I think we could
> make it cheaper. The reason why I think this might be expensive is:
> currently, for every datum, you have a single direct function call.
> Now, with this, you first have a direct function call to
> GetCompressionOidFromCompressionId(). Then you have a call to
> GetCompressionRoutine(), which does a syscache lookup and calls a
> handler function, which is quite a lot more expensive than a single
> function call. And the handler isn't even returning a statically
> allocated structure, but is allocating new memory every time, which
> involves more function calls and maybe memory leaks. Then you use the
> results of all that to make an indirect function call.
>
> I'm not sure exactly what combination of things we could use to make
> this better, but it seems like there are a few possibilities:
>
> (1) The handler function could return a pointer to the same
> CompressionRoutine every time instead of constructing a new one every
> time.
> (2) The CompressionRoutine to which the handler function returns a
> pointer could be statically allocated instead of being built at
> runtime.
> (3) GetCompressionRoutine could have an OID -> handler cache instead
> of relying on syscache + calling the handler function all over again.
> (4) For the compression types that have dedicated bit patterns in the
> high bits of the compressed TOAST size, toast_compress_datum() could
> just have hard-coded logic to use the correct handlers instead of
> translating the bit pattern into an OID and then looking it up over
> again.
> (5) Going even further than #4 we could skip the handler layer
> entirely for such methods, and just call the right function directly.
>
> I think we should definitely do (1), and also (2) unless there's some
> reason it's hard. (3) doesn't need to be part of this patch, but might
> be something to consider later in the series. It's possible that it
> doesn't have enough benefit to be worth the work, though. Also, I
> think we should do either (4) or (5). I have a mild preference for (5)
> unless it looks too ugly.
>
> Note that I'm not talking about hard-coding a fast path for a
> hard-coded list of OIDs - which would seem a little bit unprincipled -
> but hard-coding a fast path for the bit patterns that are themselves
> hard-coded. I don't think we lose anything in terms of extensibility
> or even-handedness there; it's just avoiding a bunch of rigamarole
> that doesn't really buy us anything.
>
> All these points apply equally to toast_decompress_datum_slice() and
> toast_compress_datum().

I agree that (1) and (2) we shall definitely do as part of the first
patch and (3) we might do in later patches.  I think from (4) and (5)
I am more inclined to do (4) for a couple of reasons
a) If we bypass the handler function and directly calls the
compression and decompression routines then we need to check whether
the current executable is compiled with this particular compression
library or not for example in 'lz4handler' we have this below check,
now if we don't have the handler function we either need to put this
in each compression/decompression functions or we need to put is in
each caller.
Datum
lz4handler(PG_FUNCTION_ARGS)
{
#ifndef HAVE_LIBLZ4
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("not built with lz4 support")));
#else

b) Another reason is that once we start supporting the compression
options (0006-Support-compression-methods-options.patch) then we also
need to call 'cminitstate_function' for parsing the compression
options and then calling the compression function, so we need to
hardcode multiple function calls.

I think b) is still okay but because of a) I am more inclined to do
(4), what is your opinion on this?

About (4), one option is that we directly call the correct handler
function for the built-in type directly from
toast_(de)compress(_slice) functions but in that case, we are
duplicating the code, another option is that we call the
GetCompressionRoutine() a common function and in that, for the
built-in type, we can directly call the corresponding handler function
and get the routine.  The only thing is to avoid duplicating in
decompression routine we need to convert CompressionId to Oid before
calling GetCompressionRoutine(), but now we can avoid sys cache lookup
for the built-in type.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Li Japin
Date:
Subject: Remove cache_plan argument comments to ri_PlanCheck
Next
From: Anastasia Lubennikova
Date:
Subject: Re: pgbench and timestamps (bounced)