Thread: TOAST compression
toast_compress_datum() considers compression to be "successful" if the compressed version of the datum is smaller than the uncompressed version. I think this is overly generous: if compression reduces the size of the datum by, say, 0.01%, it is likely a net loss to use the compressed version of the datum since we'll need to pay for LZ decompression every time that we de-TOAST it. This situation can occur frequently when storing "mostly-uncompressible" data (compressed images, encrypted data, etc.) -- some parts of the data will compress well (e.g. metadata), but the vast majority will not. It's true that LZ decompression is fast, so we should probably use the compressed version of the datum unless the reduction in size is very small. I'm not sure precisely what that threshold should be, however. Comments? -Neil
Neil Conway wrote: > toast_compress_datum() considers compression to be "successful" if the > compressed version of the datum is smaller than the uncompressed > version. I think this is overly generous: if compression reduces the > size of the datum by, say, 0.01%, it is likely a net loss to use the > compressed version of the datum since we'll need to pay for LZ > decompression every time that we de-TOAST it. This situation can occur > frequently when storing "mostly-uncompressible" data (compressed images, > encrypted data, etc.) -- some parts of the data will compress well (e.g. > metadata), but the vast majority will not. > > It's true that LZ decompression is fast, so we should probably use the > compressed version of the datum unless the reduction in size is very > small. I'm not sure precisely what that threshold should be, however. > > Comments? 20%? 25% -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. +
On Sat, Feb 25, 2006 at 09:39:34PM -0500, Neil Conway wrote: > It's true that LZ decompression is fast, so we should probably use the > compressed version of the datum unless the reduction in size is very > small. I'm not sure precisely what that threshold should be, however. Any idea on how decompression time compares to IO bandwidth? In other words, how long does it take to decompress 1MB vs read that 1MB vs read whatever the uncompressed size is? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Neil Conway <neilc@samurai.com> writes: > toast_compress_datum() considers compression to be "successful" if the > compressed version of the datum is smaller than the uncompressed > version. I think this is overly generous: if compression reduces the > size of the datum by, say, 0.01%, it is likely a net loss to use the > compressed version of the datum since we'll need to pay for LZ > decompression every time that we de-TOAST it. This situation can occur > frequently when storing "mostly-uncompressible" data (compressed images, > encrypted data, etc.) -- some parts of the data will compress well (e.g. > metadata), but the vast majority will not. Does it really occur frequently? When dealing with already-compressed or encrypted data, the LZ transform actually makes the data larger by some small percentage. This will outweigh any savings on compressible headers or what have you, just because those are only a tiny part of the file to begin with. (Else the format designers would have found a way to compress them too.) So I'd expect the existing test to catch most of the real-world cases you cite. I'm not particularly inclined to worry about this without some hard evidence that it's a problem. You'd need some numerical evidence anyway to justify any specific threshold, else it's just as arbitrary as "is it smaller" ... and the latter at least requires a few instructions less to check. regards, tom lane
Jim, On 2/26/06 8:00 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: > Any idea on how decompression time compares to IO bandwidth? In other > words, how long does it take to decompress 1MB vs read that 1MB vs read > whatever the uncompressed size is? On DBT-3 data, I've just run some tests meant to simulate the speed differences of compression versus native I/O. My thought is that an external use of gzip on a binary dump file should be close to the speed of LZW on toasted fields, so I just dumped the "supplier" table (see below) of size 202MB in data pages to disk, then ran gzip/gunzip on the the binary file. Second test - an 8k block dd from that same file, meant to simulate a seq scan (it's faster by 25% than doing it in PG though): ==================== gzip/gunzip ===================== [mppdemo1@salerno0]$ ls -l supplier.bin -rw-r--r-- 1 mppdemo1 mppdemo1 177494266 Feb 26 09:17 supplier.bin [mppdemo1@salerno0]$ time gzip supplier.bin real 0m12.979s user 0m12.558s sys 0m0.400s [mppdemo1@salerno0]$ time gunzip supplier.bin real 0m2.286s user 0m1.713s sys 0m0.573s [mppdemo1@salerno0]$ time dd if=supplier.bin of=/dev/null bs=8k 21666+1 records in 21666+1 records out real 0m0.138s user 0m0.003s sys 0m0.135s ==================== Interpretation ===================== Zip speed: 177.5 MB in 13 seconds, or 13.6 MB/s Unzip speed: 177.5 MB in 2.29 seconds, or 77.5 MB/s Direct access speed: 177.5 MB in 0.138 seconds or 1,286 MB/s Note that this filesystem can do about 400MB/s, and we routinely see scan rates of 300MB/s within PG, so the real comparision is: Direct seqscan at 300MB/s versus gunzip at 77.5MB/s ==================== Background data ===================== demo=# \d supplier Table "public.supplier" Column | Type | Modifiers -------------+------------------------+-----------s_suppkey | integer | not nulls_name | character(25) | not nulls_address | character varying(40) | not nulls_nationkey | integer | notnulls_phone | character(15) | not nulls_acctbal | numeric(15,2) | not nulls_comment | charactervarying(101) | not null demo=# select relname,8*relpages/128 as MB from pg_class order by relpages desc limit 6;relname | mb ----------+--------lineitem | 123434orders | 24907partsupp | 14785part | 3997customer | 3293supplier | 202 (6 rows) Time: 2.024 ms demo=# copy supplier to '/tmp/supplier.bin' with binary; COPY Time: 7328.186 ms demo=# copy supplier to '/tmp/supplier.txt'; COPY Time: 5503.168 ms ******** Note how the text file dumps faster than binary, and it's smaller ******** at 148MB. demo=# select version(); version ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- -----PostgreSQL 8.1.3 (Bizgres MPP 2.1) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2) compiled on Feb 23 2006 11:34:06 (1 row)
On Sun, Feb 26, 2006 at 09:31:05AM -0800, Luke Lonergan wrote: > Note that this filesystem can do about 400MB/s, and we routinely see scan > rates of 300MB/s within PG, so the real comparision is: > > Direct seqscan at 300MB/s versus gunzip at 77.5MB/s So the cutover point (on your system with very fast IO) is 4:1 compression (is that 20 or 25%?). But that's assuming that PostgreSQL can read data as fast as dd, which we all know isn't the case. That's also assuming a pretty top-notch IO subsystem. Based on that, I'd argue that 10% is probably a better setting, though it would be good to test an actual case (does dbt3 produce fields large enough to ensure that most of them will be toasted?) Given the variables involved, maybe it makes sense to add a GUC? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim, On 2/26/06 10:37 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: > So the cutover point (on your system with very fast IO) is 4:1 > compression (is that 20 or 25%?). Actually the size of the gzipp'ed binary file on disk was 65MB, compared to 177.5MB uncompressed, so the compression ratio is 37% (?), or 2.73:1. > But that's assuming that PostgreSQL > can read data as fast as dd, which we all know isn't the case. Actually, I had factored that in already. The filesystem delivered 1,200MB/s out of cache in this case - the 300MB/s is what we routinely do from Postgres seqscan per instance on this system. > That's > also assuming a pretty top-notch IO subsystem. True - actually I'm pretty impressed with 75MB/s gunzip speed. > Based on that, I'd argue > that 10% is probably a better setting, though it would be good to test > an actual case (does dbt3 produce fields large enough to ensure that > most of them will be toasted?) No, unfortunately not. O'Reilly's jobs data have 65K rows, so that would work. How do we implement LZW compression on toasted fields? I've never done it! > Given the variables involved, maybe it makes sense to add a GUC? Dunno - I'm not sure how the current scheme works, this is new to me. We had considered using a compression mechanism for table data, but had some completely different ideas, more along the lines of a compressing heap store. The main problem as I see it is the CPU required to get there at reasonable performance as you point out. However, the trend is inevitable - we'll soon have more CPU than we could otherwise use to work with... - Luke
Luke Lonergan wrote: > Jim, > > On 2/26/06 10:37 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: > > > So the cutover point (on your system with very fast IO) is 4:1 > > compression (is that 20 or 25%?). > > Actually the size of the gzipp'ed binary file on disk was 65MB, compared to > 177.5MB uncompressed, so the compression ratio is 37% (?), or 2.73:1. I doubt our algorithm would give the same compression (though I haven't really measured it). The LZ implementation we use is supposed to have lightning speed at the cost of a not-so-good compression ratio. > No, unfortunately not. O'Reilly's jobs data have 65K rows, so that would > work. How do we implement LZW compression on toasted fields? I've never > done it! See src/backend/utils/adt/pg_lzcompress.c -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Ühel kenal päeval, P, 2006-02-26 kell 09:31, kirjutas Luke Lonergan: > Jim, > > On 2/26/06 8:00 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: > > > Any idea on how decompression time compares to IO bandwidth? In other > > words, how long does it take to decompress 1MB vs read that 1MB vs read > > whatever the uncompressed size is? > > On DBT-3 data, I've just run some tests meant to simulate the speed > differences of compression versus native I/O. My thought is that an > external use of gzip on a binary dump file should be close to the speed of > LZW on toasted fields, Your basic assumption si probbaly wrong :( gzip what ? "compression level" setting of gzip has big effect on both compression speed and compression rate. And I suspect that even the fastest level (gzip -1) compresses slower and better than postgresql's lzcompress. > so I just dumped the "supplier" table (see below) of > size 202MB in data pages to disk, then ran gzip/gunzip on the the binary > file. Second test - an 8k block dd from that same file, meant to simulate a > seq scan (it's faster by 25% than doing it in PG though): > > ==================== gzip/gunzip ===================== > [mppdemo1@salerno0]$ ls -l supplier.bin > -rw-r--r-- 1 mppdemo1 mppdemo1 177494266 Feb 26 09:17 supplier.bin > > [mppdemo1@salerno0]$ time gzip supplier.bin > > real 0m12.979s > user 0m12.558s > sys 0m0.400s > [mppdemo1@salerno0]$ time gunzip supplier.bin > > real 0m2.286s > user 0m1.713s > sys 0m0.573s these are also somewhat bogus tests, if you would want them to be comparable with dd below, you should have used 'time gzip -c supplier.bin > /dev/null' > [mppdemo1@salerno0]$ time dd if=supplier.bin of=/dev/null bs=8k > 21666+1 records in > 21666+1 records out > > real 0m0.138s > user 0m0.003s > sys 0m0.135s ---------------- Hannu
Hannu, On 2/26/06 12:19 PM, "Hannu Krosing" <hannu@skype.net> wrote: >> On DBT-3 data, I've just run some tests meant to simulate the speed >> differences of compression versus native I/O. My thought is that an >> external use of gzip on a binary dump file should be close to the speed of >> LZW on toasted fields, > > Your basic assumption si probbaly wrong :( > > gzip what ? "compression level" setting of gzip has big effect on both > compression speed and compression rate. And I suspect that even the > fastest level (gzip -1) compresses slower and better than postgresql's > lzcompress. I thought it might form a lower bound on speed. 75MB/s on an Opteron is really super fast compared to what I expected from LZ. And since gzip uses LZ compression, maybe they'll be in the same family of results, behavior-wise. I'd be more concerned about the implementation within the executor and how the "get tuple, decompress tuple, use tuple" cycling might destroy the efficiency of the scanning compared to decompressing large blocks of data. As Jim pointed out, we would need a real test to confirm the behavior, I'm not yet acquainted with the toast compression, so it's harder for me to compose a real test. > these are also somewhat bogus tests, if you would want them to be > comparable with dd below, you should have used 'time gzip -c > supplier.bin > /dev/null' Hmm - the file writing seems to matter on decompress, making decompression even faster: [root@salerno0 tmp]# time gunzip tmp.bin.gz real 0m2.254s user 0m1.718s sys 0m0.536s [root@salerno0 tmp]# time gzip tmp.bin real 0m12.912s user 0m12.555s sys 0m0.355s [root@salerno0 tmp]# time gzip -c supplier.bin > /dev/null real 0m12.582s user 0m12.464s sys 0m0.115s [root@salerno0 tmp]# time gunzip -c tmp.bin.gz > /dev/null real 0m1.734s user 0m1.690s sys 0m0.043s - Luke
Luke, > As Jim pointed out, we would need a real test to confirm the behavior, > I'm not yet acquainted with the toast compression, so it's harder for me > to compose a real test. Check out SET STORAGE. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh, On 2/26/06 8:04 PM, "Josh Berkus" <josh@agliodbs.com> wrote: > Check out SET STORAGE. I just altered the MIVP data generator in Bizgres MPP to produce the usual 15 column table but with a 6K row size. You'd only expect a few tens of bytes variance around the 6K, and the data is randomly chosen words from a seed file of about 14.5K. I generated a file with 100,000 rows, total size is 600MB. My tests are with Bizgres MPP 2.1 GA (based on Postgres 8.1.3) on 4 machines with 8 primary segments. So, with a column storage type of PLAIN, here's the filesystem resident size: mpptestdb=# select relname,8*relpages/128 as MB from pg_class where relname='bigtable1';; relname | mb -----------+-----bigtable1 | 789 And here are a couple of selects (obviously coming from the 64GB of RAM): mpptestdb=# select count(*) from bigtable1;count --------100000 (1 row) Time: 45.685 ms mpptestdb=# select count(*) from bigtable1;count --------100000 (1 row) Time: 25.024 ms Here it is with the default EXTENDED: mpptestdb=# select relname,8*relpages/128 as MB from pg_class where relname='bigtable1'; relname | mb -----------+----bigtable1 | 20 Now I'm stuck - I can get the toast table OID, but my attempts to find out it's size are futile. Suggestions? - Luke
While I'm waiting to figure out how to get the size of the toast table, at least I can provide the speed of query with/without assumed compression on the 6K text columns. To insure that we're actually accessing the data in the rows, I do a regexp query on the TOASTed rows: mpptestdb=# select count(*) from bigtable1 where b ~ 'a';count --------100000 (1 row) Time: 700.574 ms That's 770MB of database data regexp'ed in 0.7 seconds, or about 1.1GB/second. Not bad, but slower by good measure than the I/O subsystem. Now we try the same table with PLAIN storage type: mpptestdb=# select count(*) from bigtable1 where b ~ 'a';count --------100000 (1 row) Time: 284.146 ms That's the same 770MB in 0.28 seconds or 2.8GB/s, much better. - Luke
Luke, > While I'm waiting to figure out how to get the size of the toast table, at > least I can provide the speed of query with/without assumed compression on > the 6K text columns. Check out the table_size view in the newsysviews project. Andrew computed the regular, toast, and index sizes as a query. -- Josh Berkus Aglio Database Solutions San Francisco
"Josh Berkus" <josh@agliodbs.com> wrote > > While I'm waiting to figure out how to get the size of the toast table, at > > least I can provide the speed of query with/without assumed compression on > > the 6K text columns. > > Check out the table_size view in the newsysviews project. Andrew computed the > regular, toast, and index sizes as a query. > Will pg_total_relation_size_oid()/pg_total_relation_size_name() do the job? Regards, Qingqing