Thread: Proposed patch to change TOAST compression strategy

Proposed patch to change TOAST compression strategy

From
Tom Lane
Date:
This proposed patch addresses some issues in TOAST compression strategy that
were discussed last year, but we felt it was too late in the 8.3 cycle to
change the code immediately.  See these threads for background:
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php
http://archives.postgresql.org/pgsql-general/2007-08/msg01129.php

Specifically, the patch:

* Reduces the minimum datum size to be considered for compression from
256 to 32 bytes, as suggested by Greg Stark.  (Greg later suggested
dropping the threshold clear to zero, but this is pointless since we
must save enough bytes to pay for the longer header.)

* Increases the required compression rate for compressed storage from
20% to 25%, again per Greg's suggestion.

* Replaces force_input_size (size above which compression is forced)
with a maximum size to be considered for compression.  It was agreed
that allowing large inputs to escape the minimum-compression-rate
requirement was not bright, and that indeed we'd rather have a knob
that acted in the other direction.  I set this value to 1MB for the
moment, but it could use some performance studies to tune it.

* Adds an early-failure path to the compressor as suggested by Jan:
if it's been unable to find even one compressible substring in the
first 1KB (parameterizable), assume we're looking at incompressible
input and give up.  (There are various ways this could be done, but
this way seems to add the least overhead to the compressor's inner
loop.)

* Improves the toasting heuristics so that when we have very large
fields with attstorage 'x' or 'e', we will push those out to toast
storage before considering inline compression of shorter fields.
This also responds to a suggestion of Greg's, though my original
proposal for a solution was a bit off base because it didn't fix
the problem for large 'e' fields.

There was some discussion in the earlier threads of exposing some
of the compression knobs to users, perhaps even on a per-column
basis.  I have not done anything about that here.  It seems to me
that if we are changing around the parameters, we'd better get some
experience and be sure we are happy with the design before we set
things in stone by providing user-visible knobs.

I have not done any performance testing of these changes --- does
anyone have specific test scenarios they'd like to try?

            regards, tom lane


Attachment

Re: Proposed patch to change TOAST compression strategy

From
Teodor Sigaev
Date:
 > This proposed patch addresses some issues in TOAST compression strategy that
> I have not done any performance testing of these changes --- does
> anyone have specific test scenarios they'd like to try?

That's very important change for text search, because of tsvector storage.
Headline and rank reads a lot of tsvectors. It seems to me that ranking test
will be very clear: rank function reads whole tsvector and returns small amount
of data (just a number).

Another testing focus may be a lossy indexes, like a index over polygons.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Proposed patch to change TOAST compression strategy

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> * Adds an early-failure path to the compressor as suggested by Jan:
> if it's been unable to find even one compressible substring in the
> first 1KB (parameterizable), assume we're looking at incompressible
> input and give up.  (There are various ways this could be done, but
> this way seems to add the least overhead to the compressor's inner
> loop.)

I'm not sure how to test the rest of it, but this bit seems testable. I fear
this may be too conservative. Even nigh incompressible data will find a few
backreferences.

I'll try some tests and see.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

Re: Proposed patch to change TOAST compression strategy

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
>> I have not done any performance testing of these changes --- does
>> anyone have specific test scenarios they'd like to try?

> That's very important change for text search, because of tsvector storage.
> Headline and rank reads a lot of tsvectors. It seems to me that ranking test
> will be very clear: rank function reads whole tsvector and returns small amount
> of data (just a number).

I have no test data sets with large tsvector fields.  If you do, could
you try it with this patch and see what the results are like?

            regards, tom lane

Re: Proposed patch to change TOAST compression strategy

From
Teodor Sigaev
Date:
Magnus, could you provide dump of tsvector column of archive search?
I'll make a test with and without Tom's patch.

>>> I have not done any performance testing of these changes --- does
>>> anyone have specific test scenarios they'd like to try?


--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Proposed patch to change TOAST compression strategy

From
Magnus Hagander
Date:
On Tue, Feb 19, 2008 at 06:46:24PM +0300, Teodor Sigaev wrote:
> Magnus, could you provide dump of tsvector column of archive search?
> I'll make a test with and without Tom's patch.
>
> >>>I have not done any performance testing of these changes --- does
> >>>anyone have specific test scenarios they'd like to try?


Sure, no problem. Will send you off-list, running dump now.

//Magnus

Re: Proposed patch to change TOAST compression strategy

From
Jan Wieck
Date:
On 2/18/2008 5:33 AM, Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
>> * Adds an early-failure path to the compressor as suggested by Jan:
>> if it's been unable to find even one compressible substring in the
>> first 1KB (parameterizable), assume we're looking at incompressible
>> input and give up.  (There are various ways this could be done, but
>> this way seems to add the least overhead to the compressor's inner
>> loop.)
>
> I'm not sure how to test the rest of it, but this bit seems testable. I fear
> this may be too conservative. Even nigh incompressible data will find a few
> backreferences.

One could argue that storing JPEG in a bytea column and not configuring
the column to skip compression is a pilot error. But I guess this
happens much more often than accidentally finding some data that has
zero backref within the first KB and changes pattern thereafter. Do you
have any example for data that is like that?


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


Re: Proposed patch to change TOAST compression strategy

From
Teodor Sigaev
Date:
Tsvector dump (taken by Magnus from mail archives of pgsql's lists)
http://www.sigaev.ru/misc/tstest.sql.bz2

Query:
select sum(ts_rank( vector, 'asdfjkl' )) from tstest ;

ts_rank  detoasts value in any case, even tsvector doesn't contain needed lexemes.

Test was on my notebook: Core2 Duo 1.8MHz, 2Gb with default postgres.conf

8.4 without patch:
Time: 10883,368 ms

8.4 with patch (db was reinited)
Time: 9654,266 ms



--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Proposed patch to change TOAST compression strategy

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> * Adds an early-failure path to the compressor as suggested by Jan:
>> if it's been unable to find even one compressible substring in the
>> first 1KB (parameterizable), assume we're looking at incompressible
>> input and give up.  (There are various ways this could be done, but
>> this way seems to add the least overhead to the compressor's inner
>> loop.)

> I'm not sure how to test the rest of it, but this bit seems testable. I fear
> this may be too conservative. Even nigh incompressible data will find a few
> backreferences.
> I'll try some tests and see.

On the strength of Teodor's favorable test, I went ahead and applied
this patch as-is.  We can certainly tweak the compressor logic again
if you can show an improvement by changing it further.

            regards, tom lane