Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Ildus Kurbangaliev
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id 20171130182009.1b492eb2@wp.localdomain
Whole thread Raw
In response to Re: [HACKERS] Custom compression methods  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [HACKERS] Custom compression methods
List pgsql-hackers
On Thu, 30 Nov 2017 00:30:37 +0100
Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

> While the results may look differently for other datasets, my
> conclusion is that it's unlikely we'll find another general-purpose
> algorithm beating pglz (enough for people to switch to it, as they'll
> need to worry about testing, deployment of extensions etc).
> 
> That doesn't necessarily mean supporting custom compression algorithms
> is pointless, of course, but I think people will be much more
> interested in exploiting known features of the data (instead of
> treating the values as opaque arrays of bytes).
> 
> For example, I see the patch implements a special compression method
> for tsvector values (used in the tests), exploiting from knowledge of
> internal structure. I haven't tested if that is an improvement (either
> in compression/decompression speed or compression ratio), though.
> 
> I can imagine other interesting use cases - for example values in
> JSONB columns often use the same "schema" (keys, nesting, ...), so
> can I imagine building a "dictionary" of JSON keys for the whole
> column ...
> 
> Ildus, is this a use case you've been aiming for, or were you aiming
> to use the new API in a different way?

Thank you for such good overview. I agree that pglz is pretty good as
general compression method, and there's no point to change it, at
least now.

I see few useful use cases for compression methods, it's special
compression methods for int[], timestamp[] for time series and yes,
dictionaries for jsonb, for which I have even already created an
extension (https://github.com/postgrespro/jsonbd). It's working and
giving promising results.

> 
> I wonder if the patch can be improved to handle this use case better.
> For example, it requires knowledge the actual data type, instead of
> treating it as opaque varlena / byte array. I see tsvector compression
> does that by checking typeid in the handler.
> 
> But that fails for example with this example
> 
>     db=# create domain x as tsvector;
>     CREATE DOMAIN
>     db=# create table t (a x compressed ts1);
>     ERROR:  unexpected type 28198672 for tsvector compression handler
> 
> which means it's a few brick shy to properly support domains. But I
> wonder if this should be instead specified in CREATE COMPRESSION
> METHOD instead. I mean, something like
> 
>     CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>     TYPE tsvector;
> 
> When type is no specified, it applies to all varlena values. Otherwise
> only to that type. Also, why not to allow setting the compression as
> the default method for a data type, e.g.
> 
>     CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>     TYPE tsvector DEFAULT;
> 
> would automatically add 'COMPRESSED ts1' to all tsvector columns in
> new CREATE TABLE commands.

Initial version of the patch contains ALTER syntax that change
compression method for whole types, but I have decided to remove that
functionality for now because the patch is already quite complex and it
could be added later as separate patch.

Syntax was:
ALTER TYPE <type> SET COMPRESSION <cm>;

Specifying the supported type for the compression method is a good idea.
Maybe the following syntax would be better?

CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
tsvector_compression_handler;

> 
> BTW do you expect the tsvector compression to be generally useful, or
> is it meant to be used only by the tests? If generally useful,
> perhaps it should be created in pg_compression by default. If only
> for tests, maybe it should be implemented in an extension in contrib
> (thus also serving as example how to implement new methods).
> 
> I haven't thought about the JSONB use case very much, but I suppose
> that could be done using the configure/drop methods. I mean,
> allocating the dictionary somewhere (e.g. in a table created by an
> extension?). The configure method gets the Form_pg_attribute record,
> so that should be enough I guess.
> 
> But the patch is not testing those two methods at all, which seems
> like something that needs to be addresses before commit. I don't
> expect a full-fledged JSONB compression extension, but something
> simple that actually exercises those methods in a meaningful way.

I will move to tsvector_compression_handler to separate extension in
the next version. I added it more like as example, but also it could be
used to achieve a better compression for tsvectors. Tests on maillists
database ('archie' tables):

usual compression:

maillists=# select body_tsvector, subject_tsvector into t1 from
messages; SELECT 1114213
maillists=# select pg_size_pretty(pg_total_relation_size('t1'));pg_size_pretty 
----------------1637 MB
(1 row)

tsvector_compression_handler:
maillists=# select pg_size_pretty(pg_total_relation_size('t2'));pg_size_pretty 
----------------1521 MB
(1 row)

lz4:
maillists=# select pg_size_pretty(pg_total_relation_size('t3'));pg_size_pretty 
----------------1487 MB
(1 row)

I don't stick to tsvector_compression_handler, I think if there
will some example that can use all the features then
tsvector_compression_handler could be replaced with it. My extension
for jsonb dictionaries is big enough and I'm not ready to try to include
it to the patch. I just see the use of 'drop' method, since there
should be way for extension to clean its resources, but I don't see
some simple enough usage for it in tests. Maybe just dummy methods for
'drop' and 'configure' will be enough for testing purposes.

> 
> Similarly for the compression options - we need to test that the WITH
> part is handled correctly (tsvector does not provide configure
> method).

I could add some options to tsvector_compression_handler, like options
that change pglz_compress parameters.

> 
> Which reminds me I'm confused by pg_compression_opt. Consider this:
> 
>     CREATE COMPRESSION METHOD ts1 HANDLER
> tsvector_compression_handler; CREATE TABLE t (a tsvector COMPRESSED
> ts1);
> 
>     db=# select * from pg_compression_opt ;
>      cmoptoid | cmname |          cmhandler           | cmoptions
>     ----------+--------+------------------------------+-----------
>      28198689 | ts1    | tsvector_compression_handler |
>     (1 row)
> 
>     DROP TABLE t;
> 
>     db=# select * from pg_compression_opt ;
>      cmoptoid | cmname |          cmhandler           | cmoptions
>     ----------+--------+------------------------------+-----------
>      28198689 | ts1    | tsvector_compression_handler |
>     (1 row)
> 
>     db=# DROP COMPRESSION METHOD ts1;
>     ERROR:  cannot drop compression method ts1 because other objects
>             depend on it
>     DETAIL:  compression options for ts1 depends on compression method
>              ts1
>     HINT:  Use DROP ... CASCADE to drop the dependent objects too.
> 
> I believe the pg_compression_opt is actually linked to pg_attribute,
> in which case it should include (attrelid,attnum), and should be
> dropped when the table is dropped.
> 
> I suppose it was done this way to work around the lack of
> recompression (i.e. the compressed value might have ended in other
> table), but that is no longer true.

Good point, since there is recompression now, the options could be
safely removed in case of dropping table. It will complicate pg_upgrade
but I think this is solvable.

> 
> A few more comments:
> 
> 1) The patch makes optionListToArray (in foreigncmds.c) non-static,
> but it's not used anywhere. So this seems like change that is no
> longer necessary.

I use this function in CreateCompressionOptions.

> 
> 2) I see we add two un-reserved keywords in gram.y - COMPRESSION and
> COMPRESSED. Perhaps COMPRESSION would be enough? I mean, we could do
> 
>     CREATE TABLE t (c TEXT COMPRESSION cm1);
>     ALTER ... SET COMPRESSION name ...
>     ALTER ... SET COMPRESSION none;
> 
> Although I agree the "SET COMPRESSION none" is a bit strange.

I agree, I've already changed syntax for the next version of the patch.
It's COMPRESSION instead of COMPRESSED and DROP COMPRESSION instead of
SET NOT COMPRESSED. Minus one word from grammar and it looks nicer.

> 
> 3) heap_prepare_insert uses this chunk of code
> 
> +  else if (HeapTupleHasExternal(tup)
> +    || RelationGetDescr(relation)->tdflags &
> TD_ATTR_CUSTOM_COMPRESSED
> +    || HeapTupleHasCustomCompressed(tup)
> +    || tup->t_len > TOAST_TUPLE_THRESHOLD)
> 
> Shouldn't that be rather
> 
> +  else if (HeapTupleHasExternal(tup)
> +    || (RelationGetDescr(relation)->tdflags &
> TD_ATTR_CUSTOM_COMPRESSED
> +        && HeapTupleHasCustomCompressed(tup))
> +    || tup->t_len > TOAST_TUPLE_THRESHOLD)

These conditions used for opposite directions.
HeapTupleHasCustomCompressed(tup) is true if tuple has compressed
datums inside and we need to decompress them first, and
TD_ATTR_CUSTOM_COMPRESSED flag means that relation we're putting the
data have columns with custom compression and maybe we need to compress
datums in current tuple.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] <> join selectivity estimate question
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] [PATCH] Call RelationDropStorage() for broader range ofobject drops.