Re: extensible external toast tuple support & snappy prototype - Mailing list pgsql-hackers

From Andres Freund
Subject Re: extensible external toast tuple support & snappy prototype
Date
Msg-id 20130605150144.GD28067@alap2.anarazel.de
Whole thread Raw
In response to Re: extensible external toast tuple support  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: extensible external toast tuple support & snappy prototype
Re: extensible external toast tuple support & snappy prototype
Re: extensible external toast tuple support & snappy prototype
List pgsql-hackers
On 2013-05-31 23:42:51 -0400, Robert Haas wrote:
> > This should allow for fairly easy development of a new compression
> > scheme for out-of-line toast tuples. It will *not* work for compressed
> > inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a
> > problem or that if it is, that it cannot be solved separately.

> Seems pretty sensible to me.  The patch is obviously WIP but the
> direction seems fine to me.

So, I played a bit more with this, with an eye towards getting this into
a non WIP state, but: While I still think the method for providing
indirect external Datum support is fine, I don't think my sketch for
providing extensible compression is.

As mentioned upthread we also have compressed datums inline as
VARATT_4B_C datums. The way toast_insert_or_update() is that when it
finds it needs to shrink a Datum it tries to compress it *inline* and
only if that still is to big it gets stored out of line. Changing that
doesn't sound like a good idea since it a) would make an already
complicated function even more complicated and b) would likely make the
whole thing slower since we would frequently compress with two different
methods.
So I think for compressed tuples we need an independent trick that also
works for inline compressed tuples:

The current way 4B_C datums work is that they are basically a normal 4B
Datum (but discernible by a different bit in the non-length part of the
length). Such compressed Datums store the uncompressed length of a Datum
in its first 4 bytes. Since we cannot have uncompressed Datums longer
than 1GB due to varlena limitations 2 bits in that lenght are free to
discern different compression algorithms.
So what my (absolutely prototype) patch does is to use those two bits to
discern different compression algorithms. Currently it simply assumes
that '00' is pglz while '01' is snappy-c. That would leave us with two
other possible algorithms ('11' and '10'), but we could easily enough
extend that to more algorithms if we want by not regarding the first 4
bytes as a length word but as the compression algorithm indicator if the
two high bits are set.

So, before we go even more into details here are some benchmark results
based on playing with a partial dump (1.1GB) of the public pg mailing
list archives (Thanks Alvaro!):

BEGIN;
SET toast_compression_algo = 0; -- pglz
CREATE TABLE messages ( ... );
\i ~/tmp/messages.sane.dump
Time: 43053.786 ms
ALTER TABLE messages RENAME TO messages_pglz;
COMMIT;

BEGIN;
SET toast_compression_algo = 1; -- snappy
CREATE TABLE messages ( ... );
\i ~/tmp/messages.sane.dump
Time: 30340.210 ms
ALTER TABLE messages RENAME TO messages_snappy;
COMMIT;

postgres=# \dt+
                        List of relations
 Schema |      Name       | Type  | Owner  |  Size  | Description
--------+-----------------+-------+--------+--------+-------------
 public | messages_pglz   | table | andres | 526 MB |
 public | messages_snappy | table | andres | 523 MB |

Ok, so while the data size didn't change all that much the compression
was quite noticeably faster. With snappy the most visible bottleneck is
COPY not compression although it's still in the top 3 functions...

So what about data reading?

postgres=# COPY messages_pglz TO '/dev/null' WITH BINARY;
COPY 86953
Time: 3825.241 ms
postgres=# COPY messages_snappy TO '/dev/null' WITH BINARY;
COPY 86953
Time: 3674.844 ms

Ok, so here the performance difference is relatively small. Turns out
that's because most of the time is spent in the output routines, even
though we are using BINARY mode. tsvector_send is expensive.

postgres=# COPY (SELECT rawtxt FROM messages_pglz) TO '/dev/null' WITH BINARY;
COPY 86953
Time: 2180.512 ms
postgres=# COPY (SELECT rawtxt FROM messages_snappy) TO '/dev/null' WITH BINARY;
COPY 86953
Time: 1782.810

Ok, so here the benefits are are already nicer.

Imo this shows that using a different compression algorithm is quite a
good idea.

Important questions are:
1) Which algorithms do we want? I think snappy is a good candidate but I
mostly chose it because it's BSD licenced, widely used, and has a C
implementation with a useable API. LZ4 might be another interesting
choice. Another slower algorithm with higher compression ratio
would also be a good idea for many applications.
2) Do we want to build infrastructure for more than 3 compression
algorithms? We could delay that decision till we add the 3rd.
3) Surely choosing the compression algorithm via GUC ala SET
toast_compression_algo = ... isn't the way to go. I'd say a storage
attribute is more appropriate?
4) The prototype removed knowledge about the internals of compression
from postgres.h which imo is a good idea, but that is debatable.
5) E.g. snappy stores the uncompressed length internally as a varint,
but I don't think there's a way to benefit from that on little endian
machines since the highbits we use to discern from pglz are actually
stored 4 bytes in...

Two patches attached:
1) add snappy to src/common. The integration needs some more work.
2) Combined patch that adds indirect tuple and snappy compression. Those
coul be separated, but this is an experiment so far...

Comments?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: JSON and unicode surrogate pairs
Next
From: Andres Freund
Date:
Subject: Re: MVCC catalog access