Re: Proposal: custom compression methods - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Proposal: custom compression methods
Date
Msg-id 56715670.1000304@2ndquadrant.com
Whole thread Raw
In response to Re: Proposal: custom compression methods  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Hi,

On 12/14/2015 12:51 PM, Simon Riggs wrote:
> On 13 December 2015 at 17:28, Alexander Korotkov
> <a.korotkov@postgrespro.ru <mailto:a.korotkov@postgrespro.ru>> wrote:
>
>     it would be nice to make compression methods pluggable.
>
>
> Agreed.
>
> My thinking is that this should be combined with work to make use of
> the compressed data, which is why Alvaro, Tomas, David have been
> working on Col Store API for about 18 months and work on that
> continues with more submissions for 9.6 due.

I'm not sure it makes sense to combine those two uses of compression, 
because there are various differences - some subtle, some less subtle. 
It's a bit difficult to discuss this without any column store 
background, but I'll try anyway.

The compression methods discussed in this thread, used to compress a 
single varlena value, are "general-purpose" in the sense that they 
operate on opaque stream of bytes, without any additional context (e.g. 
about structure of the data being compressed). So essentially the 
methods have an API like this:
  int   compress(char *src, int srclen, char *dst, int dstlen);  int decompress(char *src, int srclen, char *dst, int
dstlen);

And possibly some auxiliary methods like "estimate compressed length" 
and such.

OTOH the compression methods we're messing with while working on the 
column store are quite different - they operate on columns (i.e. "arrays 
of Datums"). Also, column stores prefer "light-weight" compression 
methods like RLE or DICT (dictionary compression) because those methods 
allow execution on compressed data when done properly. Which for example 
requires additional info about the data type in the column, so that the 
RLE groups match the data type length.

So the API of those methods looks quite different, compared to the 
general-purpose methods. Not only the compression/decompression methods 
will have additional parameters with info about the data type, but 
there'll be methods used for iterating over values in the compressed 
data etc.

Of course, it'd be nice to have the ability to add/remove even those 
light-weight methods, but I'm not sure it makes sense to squash them 
into the same catalog. I can imagine a catalog suitable for both APIs 
(essentially having two groups of columns, one for each type of 
compression algorithm), but I can't really imagine a compression method 
providing both interfaces at the same time.

In any case, I don't think this is the main challenge the patch needs to 
solve at this point.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: pgbench stats per script & other stuff
Next
From: Stas Kelvich
Date:
Subject: Re: Cube extension kNN support