Re: Proposal: custom compression methods - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: Proposal: custom compression methods
Date
Msg-id CAMsr+YGiN7davH54QVyaMnpQJyuO_AkbVZe6s71U-qJmwbJt3w@mail.gmail.com
Whole thread Raw
In response to Proposal: custom compression methods  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses Re: Proposal: custom compression methods  (Chapman Flack <chap@anastigmatix.net>)
Re: Proposal: custom compression methods  (Bill Moran <wmoran@potentialtech.com>)
Re: Proposal: custom compression methods  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Re: Proposal: custom compression methods  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 14 December 2015 at 01:28, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hackers,

I'd like to propose a new feature: "Custom compression methods".

Are you aware of the past work in this area? There's quite a bit of history and I strongly advise you to read the relevant threads to make sure you don't run into the same problems.

See:


for at least one of the prior attempts.
 
Motivation

Currently when datum doesn't fit the page PostgreSQL tries to compress it using PGLZ algorithm. Compression of particular attributes could be turned on/off by tuning storage parameter of column. Also, there is heuristics that datum is not compressible when its first KB is not compressible. I can see following reasons for improving this situation.

Yeah, recent discussion has made it clear that there's room for improving how and when TOAST compresses things. Per-attribute compression thresholds made a lot of sense.

Therefore, it would be nice to make compression methods pluggable.

Very important issues to consider here is on-disk format stability, space overhead, and pg_upgrade-ability. It looks like you have addressed all of these issues below by making compression methods per-column not per-Datum and forcing a full table rewrite to change it.

The issue with per-Datum is that TOAST claims two bits of a varlena header, which already limits us to 1 GiB varlena values, something people are starting to find to be a problem. There's no wiggle room to steal more bits. If you want pluggable compression you need a way to store knowledge of how a given datum is compressed with the datum or have a fast, efficient way to check.

pg_upgrade means you can't just redefine the current toast bits so the compressed bit means "data is compressed, check first byte of varlena data for algorithm" because existing data won't have that, the first byte will be the start of the compressed data stream.

There's also the issue of what you do when the algorithm used for a datum is no longer loaded. I don't care so much about that one, I'm happy to say "you ERROR and tell the user to fix the situation". But I think some people were concerned about that too, or being stuck with algorithms forever once they're added.

Looks like you've dealt with all those concerns.


DROP COMPRESSION METHOD compname;

 
When you drop a compression method what happens to data compressed with that method?

If you re-create it can the data be associated with the re-created method?
 
Compression method of column would be stored in pg_attribute table.

So you can't change it without a full table rewrite, but thus you also don't have to poach any TOAST header bits to determine which algorithm is used. And you can use pg_depend to prevent dropping a compression method still in use by a table. Makes sense.
 
Looks promising, but I haven't re-read the old thread in detail to see if this approach was already considered and rejected.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Vladimir Sitnikov
Date:
Subject: Re: W-TinyLfu for cache eviction
Next
From: Michael Paquier
Date:
Subject: Re: pgbench stats per script & other stuff