Thread: Proposal: custom compression methods

Proposal: custom compression methods

From
Alexander Korotkov
Date:
Hackers,

I'd like to propose a new feature: "Custom compression methods".

Motivation

Currently when datum doesn't fit the page PostgreSQL tries to compress it using PGLZ algorithm. Compression of particular attributes could be turned on/off by tuning storage parameter of column. Also, there is heuristics that datum is not compressible when its first KB is not compressible. I can see following reasons for improving this situation.

 * Heuristics used for detection of compressible data could be not optimal. We already met this situation with jsonb.
 * For some data distributions there could be more effective compression methods than PGLZ. For example:
     * For natural languages we could use predefined dictionaries which would allow us to compress even relatively short strings (which are not long enough for PGLZ to train its dictionary).
     * For jsonb/hstore we could implement compression methods which have dictionary of keys. This could be either static predefined dictionary or dynamically appended dictionary with some storage.
     * For jsonb and other container types we can implement compression methods which would allow extraction of particular fields without decompression of full value.

Therefore, it would be nice to make compression methods pluggable.

Design

Compression methods would be stored in pg_compress system catalog table of following structure:

compname        name
comptype  oid
compcompfunc  regproc
compdecompfunc  regproc

Compression methods could be created by "CREATE COMPRESSION METHOD" command and deleted by "DROP COMPRESSION METHOD" command.

CREATE COMPRESSION METHOD compname [FOR TYPE comptype_name]
    WITH COMPRESS FUNCTION compcompfunc_name
         DECOMPRESS FUNCTION compdecompfunc_name;
DROP COMPRESSION METHOD compname;

Signatures of compcompfunc and compdecompfunc would be similar pglz_compress and pglz_decompress except compression strategy. There is only one compression strategy in use for pglz (PGLZ_strategy_default). Thus, I'm not sure it would be useful to provide multiple strategies for compression methods.

extern int32 compcompfunc(const char *source, int32 slen, char *dest);
extern int32 compdecompfunc(const char *source, int32 slen, char *dest, int32 rawsize);

Compression method could be type-agnostic (comptype = 0) or type specific (comptype != 0). Default compression method is PGLZ.

Compression method of column would be stored in pg_attribute table. Dependencies between columns and compression methods would be tracked in pg_depend preventing dropping compression method which is currently in use. Compression method of the attribute could be altered by ALTER TABLE command.

ALTER TABLE table_name ALTER COLUMN column_name SET COMPRESSION METHOD compname;

Since mixing of different compression method in the same attribute would be hard to manage (especially dependencies tracking), altering attribute compression method would require a table rewrite.

Implementation details

Catalog changes, new commands, dependency tracking etc are mostly mechanical stuff with no fundamental problems. The hardest part seems to be providing seamless integration of custom compression methods into existing code.

It doesn't seems hard to add extra parameter with compression method to toast_compress_datum. However, PG_DETOAST_DATUM should call custom decompress function with only knowledge of datum. That means that we should somehow conceal knowledge of compression method into datum. The solution could be putting compression method oid right after varlena header. Putting this on-disk would cause storage overhead and break backward compatibility. Thus, we can add this oid right after reading datum from the page. This could be the weakest point in the whole proposal and I'll be very glad for better ideas.

P.S. I'd like to thank Petr Korobeinikov <pkorobeinikov@gmail.com> who started work on this patch and sent me draft of proposal in Russian.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Proposal: custom compression methods

From
Craig Ringer
Date:
On 14 December 2015 at 01:28, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hackers,

I'd like to propose a new feature: "Custom compression methods".

Are you aware of the past work in this area? There's quite a bit of history and I strongly advise you to read the relevant threads to make sure you don't run into the same problems.

See:


for at least one of the prior attempts.
 
Motivation

Currently when datum doesn't fit the page PostgreSQL tries to compress it using PGLZ algorithm. Compression of particular attributes could be turned on/off by tuning storage parameter of column. Also, there is heuristics that datum is not compressible when its first KB is not compressible. I can see following reasons for improving this situation.

Yeah, recent discussion has made it clear that there's room for improving how and when TOAST compresses things. Per-attribute compression thresholds made a lot of sense.

Therefore, it would be nice to make compression methods pluggable.

Very important issues to consider here is on-disk format stability, space overhead, and pg_upgrade-ability. It looks like you have addressed all of these issues below by making compression methods per-column not per-Datum and forcing a full table rewrite to change it.

The issue with per-Datum is that TOAST claims two bits of a varlena header, which already limits us to 1 GiB varlena values, something people are starting to find to be a problem. There's no wiggle room to steal more bits. If you want pluggable compression you need a way to store knowledge of how a given datum is compressed with the datum or have a fast, efficient way to check.

pg_upgrade means you can't just redefine the current toast bits so the compressed bit means "data is compressed, check first byte of varlena data for algorithm" because existing data won't have that, the first byte will be the start of the compressed data stream.

There's also the issue of what you do when the algorithm used for a datum is no longer loaded. I don't care so much about that one, I'm happy to say "you ERROR and tell the user to fix the situation". But I think some people were concerned about that too, or being stuck with algorithms forever once they're added.

Looks like you've dealt with all those concerns.


DROP COMPRESSION METHOD compname;

 
When you drop a compression method what happens to data compressed with that method?

If you re-create it can the data be associated with the re-created method?
 
Compression method of column would be stored in pg_attribute table.

So you can't change it without a full table rewrite, but thus you also don't have to poach any TOAST header bits to determine which algorithm is used. And you can use pg_depend to prevent dropping a compression method still in use by a table. Makes sense.
 
Looks promising, but I haven't re-read the old thread in detail to see if this approach was already considered and rejected.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Proposal: custom compression methods

From
Chapman Flack
Date:
On 12/14/15 01:50, Craig Ringer wrote:

> pg_upgrade means you can't just redefine the current toast bits so the
> compressed bit means "data is compressed, check first byte of varlena data
> for algorithm" because existing data won't have that, the first byte will
> be the start of the compressed data stream.

Is there any small sequence of initial bytes you wouldn't ever see in PGLZ
output?  Either something invalid, or something obviously nonoptimal
like run(n,'A')||run(n,'A') where PGLZ would have just output run(2n,'A')?

-Chap



Re: Proposal: custom compression methods

From
Craig Ringer
Date:
On 14 December 2015 at 15:27, Chapman Flack <chap@anastigmatix.net> wrote:
On 12/14/15 01:50, Craig Ringer wrote:

> pg_upgrade means you can't just redefine the current toast bits so the
> compressed bit means "data is compressed, check first byte of varlena data
> for algorithm" because existing data won't have that, the first byte will
> be the start of the compressed data stream.

Is there any small sequence of initial bytes you wouldn't ever see in PGLZ
output?  Either something invalid, or something obviously nonoptimal
like run(n,'A')||run(n,'A') where PGLZ would have just output run(2n,'A')?


I don't think we need to worry, since doing it per-column makes this issue go away. Per-Datum compression would make it easier to switch methods (requiring no table rewrite) at the cost of more storage for each varlena, which probably isn't worth it anyway.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Proposal: custom compression methods

From
Bill Moran
Date:
On Mon, 14 Dec 2015 14:50:57 +0800
Craig Ringer <craig@2ndquadrant.com> wrote:

> On 14 December 2015 at 01:28, Alexander Korotkov <a.korotkov@postgrespro.ru>
> wrote:
> 
> > Hackers,
> >
> > I'd like to propose a new feature: "Custom compression methods".

I missed the initial post on this thread ...

Have you started or do you plan to actually start work on this? I've
already started some preliminary coding with this as one of it's
goals, but it's been stalled for about a month due to life
intervening. I plan to come back to it soon, so if you decide to
start actually writing code, please contact me so we don't double
up on the work.

-- 
Bill Moran



Re: Proposal: custom compression methods

From
Simon Riggs
Date:
On 13 December 2015 at 17:28, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
 
it would be nice to make compression methods pluggable.

Agreed.

My thinking is that this should be combined with work to make use of the compressed data, which is why Alvaro, Tomas, David have been working on Col Store API for about 18 months and work on that continues with more submissions for 9.6 due.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Proposal: custom compression methods

From
Jim Nasby
Date:
On 12/14/15 12:50 AM, Craig Ringer wrote:
> The issue with per-Datum is that TOAST claims two bits of a varlena
> header, which already limits us to 1 GiB varlena values, something
> people are starting to find to be a problem. There's no wiggle room to
> steal more bits. If you want pluggable compression you need a way to
> store knowledge of how a given datum is compressed with the datum or
> have a fast, efficient way to check.

... unless we allowed for 8 byte varlena headers. Compression changes 
themselves certainly don't warrant that, but if people are already 
unhappy with 1GB TOAST then maybe that's enough.

The other thing this might buy us are a few bits that could be used to 
support Datum versioning for other purposes, such as when the binary 
format of something changes. I would think that at some point we'll need 
that for pg_upgrade.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Proposal: custom compression methods

From
Andres Freund
Date:
On 2015-12-14 14:50:57 +0800, Craig Ringer wrote:
>
http://www.postgresql.org/message-id/flat/20130615102028.GK19500@alap2.anarazel.de#20130615102028.GK19500@alap2.anarazel.de

> The issue with per-Datum is that TOAST claims two bits of a varlena header,
> which already limits us to 1 GiB varlena values, something people are
> starting to find to be a problem. There's no wiggle room to steal more
> bits. If you want pluggable compression you need a way to store knowledge
> of how a given datum is compressed with the datum or have a fast, efficient
> way to check.
>
> pg_upgrade means you can't just redefine the current toast bits so the
> compressed bit means "data is compressed, check first byte of varlena data
> for algorithm" because existing data won't have that, the first byte will
> be the start of the compressed data stream.

I don't think there's an actual problem here. My old patch that you
referenced solves this.

Andres



Re: Proposal: custom compression methods

From
Tomas Vondra
Date:
Hi,

On 12/14/2015 12:51 PM, Simon Riggs wrote:
> On 13 December 2015 at 17:28, Alexander Korotkov
> <a.korotkov@postgrespro.ru <mailto:a.korotkov@postgrespro.ru>> wrote:
>
>     it would be nice to make compression methods pluggable.
>
>
> Agreed.
>
> My thinking is that this should be combined with work to make use of
> the compressed data, which is why Alvaro, Tomas, David have been
> working on Col Store API for about 18 months and work on that
> continues with more submissions for 9.6 due.

I'm not sure it makes sense to combine those two uses of compression, 
because there are various differences - some subtle, some less subtle. 
It's a bit difficult to discuss this without any column store 
background, but I'll try anyway.

The compression methods discussed in this thread, used to compress a 
single varlena value, are "general-purpose" in the sense that they 
operate on opaque stream of bytes, without any additional context (e.g. 
about structure of the data being compressed). So essentially the 
methods have an API like this:
  int   compress(char *src, int srclen, char *dst, int dstlen);  int decompress(char *src, int srclen, char *dst, int
dstlen);

And possibly some auxiliary methods like "estimate compressed length" 
and such.

OTOH the compression methods we're messing with while working on the 
column store are quite different - they operate on columns (i.e. "arrays 
of Datums"). Also, column stores prefer "light-weight" compression 
methods like RLE or DICT (dictionary compression) because those methods 
allow execution on compressed data when done properly. Which for example 
requires additional info about the data type in the column, so that the 
RLE groups match the data type length.

So the API of those methods looks quite different, compared to the 
general-purpose methods. Not only the compression/decompression methods 
will have additional parameters with info about the data type, but 
there'll be methods used for iterating over values in the compressed 
data etc.

Of course, it'd be nice to have the ability to add/remove even those 
light-weight methods, but I'm not sure it makes sense to squash them 
into the same catalog. I can imagine a catalog suitable for both APIs 
(essentially having two groups of columns, one for each type of 
compression algorithm), but I can't really imagine a compression method 
providing both interfaces at the same time.

In any case, I don't think this is the main challenge the patch needs to 
solve at this point.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Proposal: custom compression methods

From
Tomas Vondra
Date:

On 12/14/2015 07:28 PM, Jim Nasby wrote:
> On 12/14/15 12:50 AM, Craig Ringer wrote:
>> The issue with per-Datum is that TOAST claims two bits of a varlena
>> header, which already limits us to 1 GiB varlena values, something
>> people are starting to find to be a problem. There's no wiggle room to
>> steal more bits. If you want pluggable compression you need a way to
>> store knowledge of how a given datum is compressed with the datum or
>> have a fast, efficient way to check.
>
> ... unless we allowed for 8 byte varlena headers. Compression changes
> themselves certainly don't warrant that, but if people are already
> unhappy with 1GB TOAST then maybe that's enough.
>
> The other thing this might buy us are a few bits that could be used to
> support Datum versioning for other purposes, such as when the binary
> format of something changes. I would think that at some point we'll need
> that for pg_upgrade.

While versioning or increasing the 1GB limit are interesting, they have 
nothing to do with this particular patch. (BTW I don't see how the 
versioning would work at varlena level - that's something that needs to 
be handled at data type level).

I think the only question we need to ask here is whether we want to 
allow mixed compression for a column. If no, we're OK with the current 
header. This is what the patch does, as it requires a rewrite after 
changing the compression method.

And we're not painting ourselves in the corner - if we decide to 
increase the varlena header size in the future, this patch does not make 
it any more complicated.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Proposal: custom compression methods

From
Tomas Vondra
Date:
Hi,

On 12/13/2015 06:28 PM, Alexander Korotkov wrote:>
> Compression method of column would be stored in pg_attribute table.
> Dependencies between columns and compression methods would be tracked in
> pg_depend preventing dropping compression method which is currently in
> use. Compression method of the attribute could be altered by ALTER TABLE
> command.
>
> ALTER TABLE table_name ALTER COLUMN column_name SET COMPRESSION METHOD
> compname;

Do you plan to make this available in CREATE TABLE? For example 
Greenplum allows to specify COMPRESSTYPE/COMPRESSLEVEL per column.

What about compression levels? Do you plan to allow tweaking them? 
Tracking them would require another column in pg_attribute, probably.

> Since mixing of different compression method in the same attribute
> would be hard to manage (especially dependencies tracking), altering
> attribute compression method would require a table rewrite.

I don't think the dependency tracking would be a big issue. The easiest 
we could do is simply track which columns used the compression type in 
the past, and scan them when removing it (the compression type).

I think the main obstacle to make this possible is the lack of free 
space in varlena header / need to add the ID of the compression method 
into the value.

FWIW I'd like to allow this (mixing compression types), but I don't 
think it's worth the complexity at this point. We can add that later, if 
it turns out to be a problem in practice (which it probably won't).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Proposal: custom compression methods

From
Andres Freund
Date:
On 2015-12-16 14:14:36 +0100, Tomas Vondra wrote:
> I think the main obstacle to make this possible is the lack of free space in
> varlena header / need to add the ID of the compression method into the
> value.
> 
> FWIW I'd like to allow this (mixing compression types), but I don't think
> it's worth the complexity at this point. We can add that later, if it turns
> out to be a problem in practice (which it probably won't).

Again. Unless I miss something that was solved in
www.postgresql.org/message-id/flat/20130615102028.GK19500@alap2.anarazel.de

Personally I think we should add lz4 and maybe on strongly compressing
algorithm and be done with it. Runtime extensibility will make this much
more complicated, and I doubt it's worth the complexity.



Re: Proposal: custom compression methods

From
Michael Paquier
Date:
On Wed, Dec 16, 2015 at 10:17 PM, Andres Freund <andres@anarazel.de> wrote:
> Again. Unless I miss something that was solved in
> www.postgresql.org/message-id/flat/20130615102028.GK19500@alap2.anarazel.de
>
> Personally I think we should add lz4 and maybe on strongly compressing
> algorithm and be done with it. Runtime extensibility will make this much
> more complicated, and I doubt it's worth the complexity.

+1, at the end we are going to have the same conclusion as for FPW
compression when we discussed that, let's just switch to something
that is less CPU-consuming than pglz, and lz4 is a good candidate for
that.
-- 
Michael



Re: Proposal: custom compression methods

From
Jim Nasby
Date:
On 12/16/15 7:03 AM, Tomas Vondra wrote:
>
> While versioning or increasing the 1GB limit are interesting, they have
> nothing to do with this particular patch. (BTW I don't see how the
> versioning would work at varlena level - that's something that needs to
> be handled at data type level).

Right, but that's often going to be very hard to do and still support 
pg_upgrade. It's not like most types have spare bits laying around. 
Granted, this still means non-varlena types are screwed.

> I think the only question we need to ask here is whether we want to
> allow mixed compression for a column. If no, we're OK with the current
> header. This is what the patch does, as it requires a rewrite after
> changing the compression method.

I think that is related to the other items though: none of those other 
items (except maybe the 1G limit) seem to warrant dorking with varlena, 
but if there were 3 separate features that could make use of support for 
8 byte varlena then perhaps it's time to invest in that effort. 
Especially since IIRC we're currently out of bits, so if we wanted to 
change this we'd need to do it at least a release in advance so there 
was a version that would expand 4 byte varlena to 8 byte as needed.

> And we're not painting ourselves in the corner - if we decide to
> increase the varlena header size in the future, this patch does not make
> it any more complicated.

True.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com