Thread: pg_lzcompress strategy parameters

pg_lzcompress strategy parameters

From
Tom Lane
Date:
Greg complained here
http://archives.postgresql.org/pgsql-patches/2007-07/msg00342.php
that the default strategy parameters used by the TOAST compressor
might need some adjustment.  After thinking about it a little I wonder
whether they're not even more broken than that.  The present behavior
is:

1. Never compress for inputs < min_input_size (256 bytes by default).
2. Compress inputs >= force_input_size (6K by default), as long as  compression produces a result at least 1 byte
smallerthan the input.
 
3. For inputs between min_input_size and force_input_size, compress only  if compression of at least min_comp_rate
percentis achieved  (20% by default).
 

This whole structure seems a bit broken, independently of whether the
particular parameter values are good.  If the compressor is given an
input of 1000000 bytes and manages to compress it to 999999 bytes,
we'll store it compressed, and pay for decompression cycles on every
access, even though the I/O savings are nonexistent.  That's not sane.

I'm inclined to think that the concept of force_input_size is wrong.
Instead I suggest that we have a min_comp_rate (minimum percentage
savings) and a min_savings (minimum absolute savings), and compress
if either one is met.  For instance, with min_comp_rate = 10% and
min_savings = 1MB, then for inputs below 10MB you'd require at least
10% savings to compress them, but for inputs above 10MB you'd require
at least 1MB saved to compress.

Or maybe it should just be a min_comp_rate and nothing else.
Compressing a 1GB field to 999MB is probably not very sane either.

This is all independent of what the specific parameter settings should
be, but I concur with Greg that those could do with a fresh look.

Thoughts?
        regards, tom lane


Re: pg_lzcompress strategy parameters

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tom Lane wrote:
> 
> I'm inclined to think that the concept of force_input_size is wrong.
> Instead I suggest that we have a min_comp_rate (minimum percentage
> savings) and a min_savings (minimum absolute savings), and compress
> if either one is met.  For instance, with min_comp_rate = 10% and
> min_savings = 1MB, then for inputs below 10MB you'd require at least
> 10% savings to compress them, but for inputs above 10MB you'd require
> at least 1MB saved to compress.

I would agree with the above, and even possibly suggest the ability to
set this as a GUC or per table. I may be willing to pay a very heavy
cost if I new that the data would only be accessed intermittently.

Joshua D. Drake

- --
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997            http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD0DBQFGtSYCATb/zqfZUUQRAtzAAJ4/BJ0cSn+DX5Oee1U+jj8da9nQWgCQ4q2o
etzgDblKI7eXsAFVwzcq
=eOFz
-----END PGP SIGNATURE-----


Re: pg_lzcompress strategy parameters

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> This whole structure seems a bit broken, independently of whether the
> particular parameter values are good.  If the compressor is given an
> input of 1000000 bytes and manages to compress it to 999999 bytes,
> we'll store it compressed, and pay for decompression cycles on every
> access, even though the I/O savings are nonexistent.  That's not sane.

Especially given that uncompressed toasted data is quite a bit more flexible
in that it can handle substr() efficiently.

Thinking about it, if the datum is stored inline then a single byte saved is
at least theoretically helpful. If it's stored in a toast table then anything
less than 2k is pretty slim odds to be helpful at all even if the percentage
gain is pretty big.

I don't know what the right answer is yet but it looks to me like there does
need to be two strategies, one for inline toasted tuples and one for
externally toasted tuples.

Unfortunately that's not the way the toaster is structured. First it goes
through and compresses all the fields starting with the largest and then it
starts pushing out to external storage all the fields starting with the
largest remaining. It doesn't really know whether something's going to be
stored externally when it's compressing.

It seems to me that having a fairly high minimum percentage of 25% would get
pretty close to the intended behaviour. Small data which happens to be highly
compressible would only have to save 8-32 bytes to be compressed. Data over 8k
would have to save at least 2k or more to be compressed.

(Incidentally, this means what I said earlier about uselessly trying to
compress objects below 256 is even grosser than I realized. If you have a
single large object which even after compressing will be over the toast target
it will force *every* varlena to be considered for compression even though
they mostly can't be compressed. Considering a varlena smaller than 256 for
compression only costs a useless palloc, so it's not the end of the world but
still. It does seem kind of strange that a tuple which otherwise wouldn't be
toasted at all suddenly gets all its fields compressed if you add one more
field which ends up being stored externally.)

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: pg_lzcompress strategy parameters

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> (Incidentally, this means what I said earlier about uselessly trying to
> compress objects below 256 is even grosser than I realized. If you have a
> single large object which even after compressing will be over the toast target
> it will force *every* varlena to be considered for compression even though
> they mostly can't be compressed. Considering a varlena smaller than 256 for
> compression only costs a useless palloc, so it's not the end of the world but
> still. It does seem kind of strange that a tuple which otherwise wouldn't be
> toasted at all suddenly gets all its fields compressed if you add one more
> field which ends up being stored externally.)

Yeah.  It seems like we should modify the first and third loops so that
if (after compression if any) the largest attribute is *by itself*
larger than the target threshold, then we push it out to the toast table
immediately, rather than continuing to compress other fields that might
well not need to be touched.
        regards, tom lane


Re: pg_lzcompress strategy parameters

From
Jan Wieck
Date:
On 8/5/2007 6:30 PM, Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
>> (Incidentally, this means what I said earlier about uselessly trying to
>> compress objects below 256 is even grosser than I realized. If you have a
>> single large object which even after compressing will be over the toast target
>> it will force *every* varlena to be considered for compression even though
>> they mostly can't be compressed. Considering a varlena smaller than 256 for
>> compression only costs a useless palloc, so it's not the end of the world but
>> still. It does seem kind of strange that a tuple which otherwise wouldn't be
>> toasted at all suddenly gets all its fields compressed if you add one more
>> field which ends up being stored externally.)
> 
> Yeah.  It seems like we should modify the first and third loops so that
> if (after compression if any) the largest attribute is *by itself*
> larger than the target threshold, then we push it out to the toast table
> immediately, rather than continuing to compress other fields that might
> well not need to be touched.

I agree with the general lack of sanity in the logic and think this one 
is a good starter.

Another optimization to think about would eventually be to let the 
compressor abort the attempt after the first X bytes had to be copied 
literally. People do have the possibility to disable compression on a 
per column base, but how many actually do so? and if the first 100,000 
bytes of a 10M attribute can't be compressed, it is very likely that the 
input is compressed already.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #