Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id CAPpHfdtxo7VYd1hrrXAMkBZOk-ZPNhH=SFaJwaGedT5sZ34VrA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Custom compression methods  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: [HACKERS] Custom compression methods
List pgsql-hackers
On Mon, Apr 23, 2018 at 12:40 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
On 22.04.2018 16:21, Alexander Korotkov wrote:
On Fri, Apr 20, 2018 at 7:45 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
On 30.03.2018 19:50, Ildus Kurbangaliev wrote:
On Mon, 26 Mar 2018 20:38:25 +0300
Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru> wrote:

Attached rebased version of the patch. Fixed conflicts in pg_class.h.

New rebased version due to conflicts in master. Also fixed few errors
and removed cmdrop method since it couldnt be tested.

 I seems to be useful (and not so difficult) to use custom compression methods also for WAL compression: replace direct calls of pglz_compress in xloginsert.c

I'm going to object this at point, and I've following arguments for that:

1) WAL compression is much more critical for durability than datatype
compression.  Imagine, compression algorithm contains a bug which
cause decompress method to issue a segfault.  In the case of datatype
compression, that would cause crash on access to some value which
causes segfault; but in the rest database will be working giving you
a chance to localize the issue and investigate that.  In the case of
WAL compression, recovery would cause a server crash.  That seems
to be much more serious disaster.  You wouldn't be able to make
your database up and running and the same happens on the standby.

Well, I do not think that somebody will try to implement its own compression algorithm...

But that the main goal of this patch: let somebody implement own compression
algorithm which best fit for particular dataset.
 
From my point of view the main value of this patch is that it allows to replace pglz algorithm with more efficient one, for example zstd.
At some data sets zstd provides more than 10 times better compression ratio and at the same time is faster then pglz.

Not exactly.  If we want to replace pglz with more efficient one, then we should
just replace pglz with better algorithm.  Pluggable compression methods are
definitely don't worth it for just replacing pglz with zstd.
 
I do not think that risk of data corruption caused by WAL compression with some alternative compression algorithm (zlib, zstd,...) is higher than in case of using builtin Postgres compression.

It speaking about zlib or zstd, then yes risk of corruption is very low.  But again,
switching to zlib or zstd don't justify this patch.
2) Idea of custom compression method is that some columns may
have specific data distribution, which could be handled better with
particular compression method and particular parameters.  In the
WAL compression you're dealing with the whole WAL stream containing
all the values from database cluster.  Moreover, if custom compression
method are defined for columns, then in WAL stream you've values
already compressed in the most efficient way.  However, it might
appear that some compression method is better for WAL in general
case (there are benchmarks showing our pglz is not very good in
comparison to the alternatives).  But in this case I would prefer to just
switch our WAL to different compression method one day.  Thankfully
we don't preserve WAL compatibility between major releases.

Frankly speaking I do not believe that somebody will use custom compression in this way: implement its own compression methods for the specific data type.
May be just for json/jsonb, but also only in the case when custom compression API allows to separately store compression dictionary (which as far as I understand is not currently supported).

When I worked for SciDB (database for scientists which has to deal mostly with multidimensional arrays of data) our first intention was to implement custom compression methods for the particular data types and data distributions. For example, there are very fast, simple and efficient algorithms for encoding sequence of monotonically increased integers, ....
But after several experiments we rejected this idea and switch to using generic compression methods. Mostly because we do not want compressor to know much about page layout, data type representation,... In Postgres, from my point of view,  we have similar situation. Assume that we have column of serial type. So it is good candidate of compression, isn't it?

No, it's not.  Exactly because compressor shouldn't deal with page layout etc.
But it's absolutely OK for datatype compressor to deal with particular type
representation.
 
But this approach deals only with particular attribute values. It can not take any advantages from the fact that this particular column is monotonically increased. It can be done only with page level compression, but it is a different story.

Yes, compression of data series spear across multiple rows is different story.
 
So current approach works only for blob-like types: text, json,...  But them usually have quite complex internal structure and for them universal compression algorithms used to be more efficient than any hand-written specific implementation. Also algorithms like zstd, are able to efficiently recognize and compress many common data distributions, line monotonic sequences, duplicates, repeated series,...

Some types blob-like datatypes might be not long enough to let generic
compression algorithms like zlib or zstd train a dictionary.  For example,
MySQL successfully utilize column-level dictionaries for JSON [1].  Also
JSON(B) might utilize some compression which let user extract
particular attributes without decompression of the whole document.
3) This patch provides custom compression methods recorded in
the catalog.  During recovery you don't have access to the system
catalog, because it's not recovered yet, and can't fetch compression
method metadata from there.  The possible thing is to have GUC,
which stores shared module and function names for WAL compression.
But that seems like quite different mechanism from the one present
in this patch.

I do not think that assignment default compression method through GUC is so bad idea.

It's probably not so bad, but it's a different story.  Unrelated to this patch, I think. 

Taking into account all of above, I think we would give up with custom
WAL compression method.  Or, at least, consider it unrelated to this
patch.
Sorry for repeating the same thing, but from my point of view the main advantage of this patch is that it allows to replace pglz with more efficient compression algorithms.
I do not see much sense in specifying custom compression method for some particular columns.

This patch is about giving user an ability to select particular compression
method and its parameters for particular column.
 
It will be more useful from my point of view to include in this patch implementation of compression API not only or pglz, but also for zlib, zstd and may be for some other popular compressing libraries which proved their efficiency.

Postgres already has zlib dependency (unless explicitly excluded with --without-zlib), so zlib implementation can be included in Postgres build.
Other implementations can be left as module which customer can build himself. It is certainly less convenient, than using preexistred stuff, but much more convenient then making users to write this code themselves.

There is yet another aspect which is not covered by this patch: streaming compression.
Streaming compression is needed if we want to compress libpq traffic. It can be very efficient for COPY command and for replication. Also libpq compression can improve speed of queries returning large results (for example containing JSON columns) throw slow network.
I have  proposed such patch for libpq, which is using either zlib, either zstd streaming API. Postgres built-in compression implementation doesn't have streaming API at all, so it can not be used here. Certainly support of streaming  may significantly complicates compression API, so I am not sure that it actually needed to be included in this patch.
But I will be pleased if Ildus can consider this idea.

I think streaming compression seems like a completely different story.
client-server traffic compression is not just server feature.  It must
be also supported at client side.  And I really doubt it should be
pluggable.

In my opinion, you propose good things like compression of WAL
with better algorithm and compression of client-server traffic.
But I think those features are unrelated to this patch and should
be considered separately.  It's not features, which should be
added to this patch.  Regarding this patch the points you provided
more seems like criticism of the general idea.

I think the problem of this patch is that it lacks of good example.
It would be nice if Ildus implement simple compression with
column-defined dictionary (like [1] does), and show its efficiency
of real-life examples, which can't be achieved with generic
compression methods (like zlib or zstd).  That would be a good
answer to the criticism you provide.

Links


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

 

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Build fails with different versions of clang and LLVM
Next
From: Heikki Linnakangas
Date:
Subject: Re: Build fails with different versions of clang and LLVM