Thread: pg_lzcompress strategy parameters
Greg complained here http://archives.postgresql.org/pgsql-patches/2007-07/msg00342.php that the default strategy parameters used by the TOAST compressor might need some adjustment. After thinking about it a little I wonder whether they're not even more broken than that. The present behavior is: 1. Never compress for inputs < min_input_size (256 bytes by default). 2. Compress inputs >= force_input_size (6K by default), as long as compression produces a result at least 1 byte smallerthan the input. 3. For inputs between min_input_size and force_input_size, compress only if compression of at least min_comp_rate percentis achieved (20% by default). This whole structure seems a bit broken, independently of whether the particular parameter values are good. If the compressor is given an input of 1000000 bytes and manages to compress it to 999999 bytes, we'll store it compressed, and pay for decompression cycles on every access, even though the I/O savings are nonexistent. That's not sane. I'm inclined to think that the concept of force_input_size is wrong. Instead I suggest that we have a min_comp_rate (minimum percentage savings) and a min_savings (minimum absolute savings), and compress if either one is met. For instance, with min_comp_rate = 10% and min_savings = 1MB, then for inputs below 10MB you'd require at least 10% savings to compress them, but for inputs above 10MB you'd require at least 1MB saved to compress. Or maybe it should just be a min_comp_rate and nothing else. Compressing a 1GB field to 999MB is probably not very sane either. This is all independent of what the specific parameter settings should be, but I concur with Greg that those could do with a fresh look. Thoughts? regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Tom Lane wrote: > > I'm inclined to think that the concept of force_input_size is wrong. > Instead I suggest that we have a min_comp_rate (minimum percentage > savings) and a min_savings (minimum absolute savings), and compress > if either one is met. For instance, with min_comp_rate = 10% and > min_savings = 1MB, then for inputs below 10MB you'd require at least > 10% savings to compress them, but for inputs above 10MB you'd require > at least 1MB saved to compress. I would agree with the above, and even possibly suggest the ability to set this as a GUC or per table. I may be willing to pay a very heavy cost if I new that the data would only be accessed intermittently. Joshua D. Drake - -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD0DBQFGtSYCATb/zqfZUUQRAtzAAJ4/BJ0cSn+DX5Oee1U+jj8da9nQWgCQ4q2o etzgDblKI7eXsAFVwzcq =eOFz -----END PGP SIGNATURE-----
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > This whole structure seems a bit broken, independently of whether the > particular parameter values are good. If the compressor is given an > input of 1000000 bytes and manages to compress it to 999999 bytes, > we'll store it compressed, and pay for decompression cycles on every > access, even though the I/O savings are nonexistent. That's not sane. Especially given that uncompressed toasted data is quite a bit more flexible in that it can handle substr() efficiently. Thinking about it, if the datum is stored inline then a single byte saved is at least theoretically helpful. If it's stored in a toast table then anything less than 2k is pretty slim odds to be helpful at all even if the percentage gain is pretty big. I don't know what the right answer is yet but it looks to me like there does need to be two strategies, one for inline toasted tuples and one for externally toasted tuples. Unfortunately that's not the way the toaster is structured. First it goes through and compresses all the fields starting with the largest and then it starts pushing out to external storage all the fields starting with the largest remaining. It doesn't really know whether something's going to be stored externally when it's compressing. It seems to me that having a fairly high minimum percentage of 25% would get pretty close to the intended behaviour. Small data which happens to be highly compressible would only have to save 8-32 bytes to be compressed. Data over 8k would have to save at least 2k or more to be compressed. (Incidentally, this means what I said earlier about uselessly trying to compress objects below 256 is even grosser than I realized. If you have a single large object which even after compressing will be over the toast target it will force *every* varlena to be considered for compression even though they mostly can't be compressed. Considering a varlena smaller than 256 for compression only costs a useless palloc, so it's not the end of the world but still. It does seem kind of strange that a tuple which otherwise wouldn't be toasted at all suddenly gets all its fields compressed if you add one more field which ends up being stored externally.) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > (Incidentally, this means what I said earlier about uselessly trying to > compress objects below 256 is even grosser than I realized. If you have a > single large object which even after compressing will be over the toast target > it will force *every* varlena to be considered for compression even though > they mostly can't be compressed. Considering a varlena smaller than 256 for > compression only costs a useless palloc, so it's not the end of the world but > still. It does seem kind of strange that a tuple which otherwise wouldn't be > toasted at all suddenly gets all its fields compressed if you add one more > field which ends up being stored externally.) Yeah. It seems like we should modify the first and third loops so that if (after compression if any) the largest attribute is *by itself* larger than the target threshold, then we push it out to the toast table immediately, rather than continuing to compress other fields that might well not need to be touched. regards, tom lane
On 8/5/2007 6:30 PM, Tom Lane wrote: > Gregory Stark <stark@enterprisedb.com> writes: >> (Incidentally, this means what I said earlier about uselessly trying to >> compress objects below 256 is even grosser than I realized. If you have a >> single large object which even after compressing will be over the toast target >> it will force *every* varlena to be considered for compression even though >> they mostly can't be compressed. Considering a varlena smaller than 256 for >> compression only costs a useless palloc, so it's not the end of the world but >> still. It does seem kind of strange that a tuple which otherwise wouldn't be >> toasted at all suddenly gets all its fields compressed if you add one more >> field which ends up being stored externally.) > > Yeah. It seems like we should modify the first and third loops so that > if (after compression if any) the largest attribute is *by itself* > larger than the target threshold, then we push it out to the toast table > immediately, rather than continuing to compress other fields that might > well not need to be touched. I agree with the general lack of sanity in the logic and think this one is a good starter. Another optimization to think about would eventually be to let the compressor abort the attempt after the first X bytes had to be copied literally. People do have the possibility to disable compression on a per column base, but how many actually do so? and if the first 100,000 bytes of a 10M attribute can't be compressed, it is very likely that the input is compressed already. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #