Re: extensible external toast tuple support & snappy prototype - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: extensible external toast tuple support & snappy prototype |
Date | |
Msg-id | 20130605150144.GD28067@alap2.anarazel.de Whole thread Raw |
In response to | Re: extensible external toast tuple support (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: extensible external toast tuple support & snappy prototype
Re: extensible external toast tuple support & snappy prototype Re: extensible external toast tuple support & snappy prototype |
List | pgsql-hackers |
On 2013-05-31 23:42:51 -0400, Robert Haas wrote: > > This should allow for fairly easy development of a new compression > > scheme for out-of-line toast tuples. It will *not* work for compressed > > inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a > > problem or that if it is, that it cannot be solved separately. > Seems pretty sensible to me. The patch is obviously WIP but the > direction seems fine to me. So, I played a bit more with this, with an eye towards getting this into a non WIP state, but: While I still think the method for providing indirect external Datum support is fine, I don't think my sketch for providing extensible compression is. As mentioned upthread we also have compressed datums inline as VARATT_4B_C datums. The way toast_insert_or_update() is that when it finds it needs to shrink a Datum it tries to compress it *inline* and only if that still is to big it gets stored out of line. Changing that doesn't sound like a good idea since it a) would make an already complicated function even more complicated and b) would likely make the whole thing slower since we would frequently compress with two different methods. So I think for compressed tuples we need an independent trick that also works for inline compressed tuples: The current way 4B_C datums work is that they are basically a normal 4B Datum (but discernible by a different bit in the non-length part of the length). Such compressed Datums store the uncompressed length of a Datum in its first 4 bytes. Since we cannot have uncompressed Datums longer than 1GB due to varlena limitations 2 bits in that lenght are free to discern different compression algorithms. So what my (absolutely prototype) patch does is to use those two bits to discern different compression algorithms. Currently it simply assumes that '00' is pglz while '01' is snappy-c. That would leave us with two other possible algorithms ('11' and '10'), but we could easily enough extend that to more algorithms if we want by not regarding the first 4 bytes as a length word but as the compression algorithm indicator if the two high bits are set. So, before we go even more into details here are some benchmark results based on playing with a partial dump (1.1GB) of the public pg mailing list archives (Thanks Alvaro!): BEGIN; SET toast_compression_algo = 0; -- pglz CREATE TABLE messages ( ... ); \i ~/tmp/messages.sane.dump Time: 43053.786 ms ALTER TABLE messages RENAME TO messages_pglz; COMMIT; BEGIN; SET toast_compression_algo = 1; -- snappy CREATE TABLE messages ( ... ); \i ~/tmp/messages.sane.dump Time: 30340.210 ms ALTER TABLE messages RENAME TO messages_snappy; COMMIT; postgres=# \dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+-----------------+-------+--------+--------+------------- public | messages_pglz | table | andres | 526 MB | public | messages_snappy | table | andres | 523 MB | Ok, so while the data size didn't change all that much the compression was quite noticeably faster. With snappy the most visible bottleneck is COPY not compression although it's still in the top 3 functions... So what about data reading? postgres=# COPY messages_pglz TO '/dev/null' WITH BINARY; COPY 86953 Time: 3825.241 ms postgres=# COPY messages_snappy TO '/dev/null' WITH BINARY; COPY 86953 Time: 3674.844 ms Ok, so here the performance difference is relatively small. Turns out that's because most of the time is spent in the output routines, even though we are using BINARY mode. tsvector_send is expensive. postgres=# COPY (SELECT rawtxt FROM messages_pglz) TO '/dev/null' WITH BINARY; COPY 86953 Time: 2180.512 ms postgres=# COPY (SELECT rawtxt FROM messages_snappy) TO '/dev/null' WITH BINARY; COPY 86953 Time: 1782.810 Ok, so here the benefits are are already nicer. Imo this shows that using a different compression algorithm is quite a good idea. Important questions are: 1) Which algorithms do we want? I think snappy is a good candidate but I mostly chose it because it's BSD licenced, widely used, and has a C implementation with a useable API. LZ4 might be another interesting choice. Another slower algorithm with higher compression ratio would also be a good idea for many applications. 2) Do we want to build infrastructure for more than 3 compression algorithms? We could delay that decision till we add the 3rd. 3) Surely choosing the compression algorithm via GUC ala SET toast_compression_algo = ... isn't the way to go. I'd say a storage attribute is more appropriate? 4) The prototype removed knowledge about the internals of compression from postgres.h which imo is a good idea, but that is debatable. 5) E.g. snappy stores the uncompressed length internally as a varint, but I don't think there's a way to benefit from that on little endian machines since the highbits we use to discern from pglz are actually stored 4 bytes in... Two patches attached: 1) add snappy to src/common. The integration needs some more work. 2) Combined patch that adds indirect tuple and snappy compression. Those coul be separated, but this is an experiment so far... Comments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
pgsql-hackers by date: