Thread: TOAST compression

TOAST compression

From

Neil Conway

Date:

25 February 2006, 22:38:54

toast_compress_datum() considers compression to be "successful" if the
compressed version of the datum is smaller than the uncompressed
version. I think this is overly generous: if compression reduces the
size of the datum by, say, 0.01%, it is likely a net loss to use the
compressed version of the datum since we'll need to pay for LZ
decompression every time that we de-TOAST it. This situation can occur
frequently when storing "mostly-uncompressible" data (compressed images,
encrypted data, etc.) -- some parts of the data will compress well (e.g.
metadata), but the vast majority will not.

It's true that LZ decompression is fast, so we should probably use the
compressed version of the datum unless the reduction in size is very
small. I'm not sure precisely what that threshold should be, however.

Comments?

-Neil

Re: TOAST compression

From

Bruce Momjian

Date:

25 February 2006, 23:11:10

Neil Conway wrote:
> toast_compress_datum() considers compression to be "successful" if the
> compressed version of the datum is smaller than the uncompressed
> version. I think this is overly generous: if compression reduces the
> size of the datum by, say, 0.01%, it is likely a net loss to use the
> compressed version of the datum since we'll need to pay for LZ
> decompression every time that we de-TOAST it. This situation can occur
> frequently when storing "mostly-uncompressible" data (compressed images,
> encrypted data, etc.) -- some parts of the data will compress well (e.g.
> metadata), but the vast majority will not.
> 
> It's true that LZ decompression is fast, so we should probably use the
> compressed version of the datum unless the reduction in size is very
> small. I'm not sure precisely what that threshold should be, however.
> 
> Comments?

20%?  25%

--  Bruce Momjian   http://candle.pha.pa.us SRA OSS, Inc.   http://www.sraoss.com
 + If your life is a hard drive, Christ can be your backup. +

Re: TOAST compression

From

"Jim C. Nasby"

Date:

26 February 2006, 12:01:07

On Sat, Feb 25, 2006 at 09:39:34PM -0500, Neil Conway wrote:
> It's true that LZ decompression is fast, so we should probably use the
> compressed version of the datum unless the reduction in size is very
> small. I'm not sure precisely what that threshold should be, however.

Any idea on how decompression time compares to IO bandwidth? In other
words, how long does it take to decompress 1MB vs read that 1MB vs read
whatever the uncompressed size is?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: TOAST compression

From

Tom Lane

Date:

26 February 2006, 13:19:21

Neil Conway <neilc@samurai.com> writes:
> toast_compress_datum() considers compression to be "successful" if the
> compressed version of the datum is smaller than the uncompressed
> version. I think this is overly generous: if compression reduces the
> size of the datum by, say, 0.01%, it is likely a net loss to use the
> compressed version of the datum since we'll need to pay for LZ
> decompression every time that we de-TOAST it. This situation can occur
> frequently when storing "mostly-uncompressible" data (compressed images,
> encrypted data, etc.) -- some parts of the data will compress well (e.g.
> metadata), but the vast majority will not.

Does it really occur frequently?  When dealing with already-compressed
or encrypted data, the LZ transform actually makes the data larger by
some small percentage.  This will outweigh any savings on compressible
headers or what have you, just because those are only a tiny part of the
file to begin with.  (Else the format designers would have found a way
to compress them too.)  So I'd expect the existing test to catch most of
the real-world cases you cite.

I'm not particularly inclined to worry about this without some hard
evidence that it's a problem.

You'd need some numerical evidence anyway to justify any specific
threshold, else it's just as arbitrary as "is it smaller" ... and the
latter at least requires a few instructions less to check.
        regards, tom lane

Re: TOAST compression

From

"Luke Lonergan"

Date:

26 February 2006, 13:31:21

Jim,

On 2/26/06 8:00 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:

> Any idea on how decompression time compares to IO bandwidth? In other
> words, how long does it take to decompress 1MB vs read that 1MB vs read
> whatever the uncompressed size is?

On DBT-3 data, I've just run some tests meant to simulate the speed
differences of compression versus native I/O.  My thought is that an
external use of gzip on a binary dump file should be close to the speed of
LZW on toasted fields, so I just dumped the "supplier" table (see below) of
size 202MB in data pages to disk, then ran gzip/gunzip on the the binary
file.  Second test - an 8k block dd from that same file, meant to simulate a
seq scan (it's faster by 25% than doing it in PG though):

==================== gzip/gunzip =====================
[mppdemo1@salerno0]$ ls -l supplier.bin
-rw-r--r--  1 mppdemo1 mppdemo1 177494266 Feb 26 09:17 supplier.bin

[mppdemo1@salerno0]$ time gzip supplier.bin

real    0m12.979s
user    0m12.558s
sys     0m0.400s
[mppdemo1@salerno0]$ time gunzip supplier.bin

real    0m2.286s
user    0m1.713s
sys     0m0.573s
[mppdemo1@salerno0]$ time dd if=supplier.bin of=/dev/null bs=8k
21666+1 records in
21666+1 records out

real    0m0.138s
user    0m0.003s
sys     0m0.135s

==================== Interpretation =====================
Zip speed: 177.5 MB in 13 seconds, or 13.6 MB/s
Unzip speed: 177.5 MB in 2.29 seconds, or 77.5 MB/s
Direct access speed: 177.5 MB in 0.138 seconds or 1,286 MB/s

Note that this filesystem can do about 400MB/s, and we routinely see scan
rates of 300MB/s within PG, so the real comparision is:

Direct seqscan at 300MB/s versus gunzip at 77.5MB/s

==================== Background data =====================
demo=# \d supplier            Table "public.supplier"  Column    |          Type          | Modifiers
-------------+------------------------+-----------s_suppkey   | integer                | not nulls_name      |
character(25)         | not nulls_address   | character varying(40)  | not nulls_nationkey | integer                |
notnulls_phone     | character(15)          | not nulls_acctbal   | numeric(15,2)          | not nulls_comment   |
charactervarying(101) | not null
 

demo=# select relname,8*relpages/128 as MB from pg_class order by relpages
desc limit 6;relname  |   mb   
----------+--------lineitem | 123434orders   |  24907partsupp |  14785part     |   3997customer |   3293supplier |
202
(6 rows)

Time: 2.024 ms
demo=# copy supplier to '/tmp/supplier.bin' with binary;
COPY
Time: 7328.186 ms
demo=# copy supplier to '/tmp/supplier.txt';
COPY
Time: 5503.168 ms

******** Note how the text file dumps faster than binary, and it's smaller
******** at 148MB.

demo=# select version();                  
version            
----------------------------------------------------------------------------
----------------------------------------------------------------------------
-----PostgreSQL 8.1.3 (Bizgres MPP 2.1) on x86_64-unknown-linux-gnu, compiled by
GCC gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2) compiled on Feb 23 2006
11:34:06
(1 row)

Re: TOAST compression

From

"Jim C. Nasby"

Date:

26 February 2006, 14:37:53

On Sun, Feb 26, 2006 at 09:31:05AM -0800, Luke Lonergan wrote:
> Note that this filesystem can do about 400MB/s, and we routinely see scan
> rates of 300MB/s within PG, so the real comparision is:
> 
> Direct seqscan at 300MB/s versus gunzip at 77.5MB/s

So the cutover point (on your system with very fast IO) is 4:1
compression (is that 20 or 25%?). But that's assuming that PostgreSQL
can read data as fast as dd, which we all know isn't the case. That's
also assuming a pretty top-notch IO subsystem. Based on that, I'd argue
that 10% is probably a better setting, though it would be good to test
an actual case (does dbt3 produce fields large enough to ensure that
most of them will be toasted?)

Given the variables involved, maybe it makes sense to add a GUC?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: TOAST compression

From

"Luke Lonergan"

Date:

26 February 2006, 15:06:00

Jim,

On 2/26/06 10:37 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:

> So the cutover point (on your system with very fast IO) is 4:1
> compression (is that 20 or 25%?).

Actually the size of the gzipp'ed binary file on disk was 65MB, compared to
177.5MB uncompressed, so the compression ratio is 37% (?), or 2.73:1.

> But that's assuming that PostgreSQL
> can read data as fast as dd, which we all know isn't the case.

Actually, I had factored that in already.  The filesystem delivered
1,200MB/s out of cache in this case - the 300MB/s is what we routinely do
from Postgres seqscan per instance on this system.

> That's
> also assuming a pretty top-notch IO subsystem.

True - actually I'm pretty impressed with 75MB/s gunzip speed.

> Based on that, I'd argue
> that 10% is probably a better setting, though it would be good to test
> an actual case (does dbt3 produce fields large enough to ensure that
> most of them will be toasted?)

No, unfortunately not.  O'Reilly's jobs data have 65K rows, so that would
work.  How do we implement LZW compression on toasted fields?  I've never
done it!
> Given the variables involved, maybe it makes sense to add a GUC?

Dunno - I'm not sure how the current scheme works, this is new to me.

We had considered using a compression mechanism for table data, but had some
completely different ideas, more along the lines of a compressing heap
store.  The main problem as I see it is the CPU required to get there at
reasonable performance as you point out.  However, the trend is inevitable -
we'll soon have more CPU than we could otherwise use to work with...

- Luke

Re: TOAST compression

From

Alvaro Herrera

Date:

26 February 2006, 15:18:35

Luke Lonergan wrote:
> Jim,
> 
> On 2/26/06 10:37 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:
> 
> > So the cutover point (on your system with very fast IO) is 4:1
> > compression (is that 20 or 25%?).
> 
> Actually the size of the gzipp'ed binary file on disk was 65MB, compared to
> 177.5MB uncompressed, so the compression ratio is 37% (?), or 2.73:1.

I doubt our algorithm would give the same compression (though I haven't
really measured it).  The LZ implementation we use is supposed to have
lightning speed at the cost of a not-so-good compression ratio.

> No, unfortunately not.  O'Reilly's jobs data have 65K rows, so that would
> work.  How do we implement LZW compression on toasted fields?  I've never
> done it!

See src/backend/utils/adt/pg_lzcompress.c

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: TOAST compression

From

Hannu Krosing

Date:

26 February 2006, 16:19:53

Ühel kenal päeval, P, 2006-02-26 kell 09:31, kirjutas Luke Lonergan:
> Jim,
> 
> On 2/26/06 8:00 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:
> 
> > Any idea on how decompression time compares to IO bandwidth? In other
> > words, how long does it take to decompress 1MB vs read that 1MB vs read
> > whatever the uncompressed size is?
> 
> On DBT-3 data, I've just run some tests meant to simulate the speed
> differences of compression versus native I/O.  My thought is that an
> external use of gzip on a binary dump file should be close to the speed of
> LZW on toasted fields,

Your basic assumption si probbaly wrong :(

gzip what ? "compression level" setting of gzip has big effect on both
compression speed and compression rate. And I suspect that even the
fastest level (gzip -1) compresses slower and better than postgresql's
lzcompress.


>  so I just dumped the "supplier" table (see below) of
> size 202MB in data pages to disk, then ran gzip/gunzip on the the binary
> file.  Second test - an 8k block dd from that same file, meant to simulate a
> seq scan (it's faster by 25% than doing it in PG though):
> 
> ==================== gzip/gunzip =====================
> [mppdemo1@salerno0]$ ls -l supplier.bin
> -rw-r--r--  1 mppdemo1 mppdemo1 177494266 Feb 26 09:17 supplier.bin
> 
> [mppdemo1@salerno0]$ time gzip supplier.bin
> 
> real    0m12.979s
> user    0m12.558s
> sys     0m0.400s
> [mppdemo1@salerno0]$ time gunzip supplier.bin
> 
> real    0m2.286s
> user    0m1.713s
> sys     0m0.573s

these are also somewhat bogus tests, if you would want them to be
comparable with dd below, you should have used 'time gzip -c
supplier.bin > /dev/null'

> [mppdemo1@salerno0]$ time dd if=supplier.bin of=/dev/null bs=8k
> 21666+1 records in
> 21666+1 records out
> 
> real    0m0.138s
> user    0m0.003s
> sys     0m0.135s

----------------
Hannu

Re: TOAST compression

From

"Luke Lonergan"

Date:

26 February 2006, 19:01:34

Hannu,

On 2/26/06 12:19 PM, "Hannu Krosing" <hannu@skype.net> wrote:

>> On DBT-3 data, I've just run some tests meant to simulate the speed
>> differences of compression versus native I/O.  My thought is that an
>> external use of gzip on a binary dump file should be close to the speed of
>> LZW on toasted fields,
> 
> Your basic assumption si probbaly wrong :(
> 
> gzip what ? "compression level" setting of gzip has big effect on both
> compression speed and compression rate. And I suspect that even the
> fastest level (gzip -1) compresses slower and better than postgresql's
> lzcompress.

I thought it might form a lower bound on speed.  75MB/s on an Opteron is
really super fast compared to what I expected from LZ.  And since gzip uses
LZ compression, maybe they'll be in the same family of results,
behavior-wise.

I'd be more concerned about the implementation within the executor and how
the "get tuple, decompress tuple, use tuple" cycling might destroy the
efficiency of the scanning compared to decompressing large blocks of data.

As Jim pointed out, we would need a real test to confirm the behavior, I'm
not yet acquainted with the toast compression, so it's harder for me to
compose a real test.
> these are also somewhat bogus tests, if you would want them to be
> comparable with dd below, you should have used 'time gzip -c
> supplier.bin > /dev/null'

Hmm - the file writing seems to matter on decompress, making decompression
even faster:

[root@salerno0 tmp]# time gunzip tmp.bin.gz

real    0m2.254s
user    0m1.718s
sys     0m0.536s
[root@salerno0 tmp]# time gzip tmp.bin

real    0m12.912s
user    0m12.555s
sys     0m0.355s
[root@salerno0 tmp]# time gzip -c supplier.bin > /dev/null

real    0m12.582s
user    0m12.464s
sys     0m0.115s
[root@salerno0 tmp]# time gunzip -c tmp.bin.gz > /dev/null

real    0m1.734s
user    0m1.690s
sys     0m0.043s
- Luke

Re: TOAST compression

From

Josh Berkus

Date:

27 February 2006, 01:06:17

Luke,

> As Jim pointed out, we would need a real test to confirm the behavior,
> I'm not yet acquainted with the toast compression, so it's harder for me
> to compose a real test.

Check out SET STORAGE.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: TOAST compression

From

"Luke Lonergan"

Date:

27 February 2006, 02:15:14

Josh,

On 2/26/06 8:04 PM, "Josh Berkus" <josh@agliodbs.com> wrote:

> Check out SET STORAGE.

I just altered the MIVP data generator in Bizgres MPP to produce the usual
15 column table but with a 6K row size.  You'd only expect a few tens of
bytes variance around the 6K, and the data is randomly chosen words from a
seed file of about 14.5K.

I generated a file with 100,000 rows, total size is 600MB.  My tests are
with Bizgres MPP 2.1 GA (based on Postgres 8.1.3) on 4 machines with 8
primary segments.

So, with a column storage type of PLAIN, here's the filesystem resident
size:
mpptestdb=# select relname,8*relpages/128 as MB from pg_class where
relname='bigtable1';; relname  | mb  
-----------+-----bigtable1 | 789

And here are a couple of selects (obviously coming from the 64GB of RAM):
mpptestdb=# select count(*) from bigtable1;count  
--------100000
(1 row)

Time: 45.685 ms
mpptestdb=# select count(*) from bigtable1;count  
--------100000
(1 row)

Time: 25.024 ms

Here it is with the default EXTENDED:
mpptestdb=# select relname,8*relpages/128 as MB from pg_class where
relname='bigtable1'; relname  | mb 
-----------+----bigtable1 | 20

Now I'm stuck - I can get the toast table OID, but my attempts to find out
it's size are futile.

Suggestions?

- Luke

Re: TOAST compression

From

"Luke Lonergan"

Date:

27 February 2006, 02:47:24

While I'm waiting to figure out how to get the size of the toast table, at
least I can provide the speed of query with/without assumed compression on
the 6K text columns.

To insure that we're actually accessing the data in the rows, I do a regexp
query on the TOASTed rows:

mpptestdb=# select count(*) from bigtable1 where b ~ 'a';count  
--------100000
(1 row)

Time: 700.574 ms

That's 770MB of database data regexp'ed in 0.7 seconds, or about
1.1GB/second.  Not bad, but slower by good measure than the I/O subsystem.

Now we try the same table with PLAIN storage type:

mpptestdb=# select count(*) from bigtable1 where b ~ 'a';count  
--------100000
(1 row)

Time: 284.146 ms

That's the same 770MB in 0.28 seconds or 2.8GB/s, much better.

- Luke

Re: TOAST compression

From

Josh Berkus

Date:

27 February 2006, 03:17:23

Luke,

> While I'm waiting to figure out how to get the size of the toast table, at
> least I can provide the speed of query with/without assumed compression on
> the 6K text columns.

Check out the table_size view in the newsysviews project.  Andrew computed the 
regular, toast, and index sizes as a query.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

Re: TOAST compression

From

"Qingqing Zhou"

Date:

27 February 2006, 05:37:05

"Josh Berkus" <josh@agliodbs.com> wrote
> > While I'm waiting to figure out how to get the size of the toast table,
at
> > least I can provide the speed of query with/without assumed compression
on
> > the 6K text columns.
>
> Check out the table_size view in the newsysviews project.  Andrew computed
the
> regular, toast, and index sizes as a query.
>

Will pg_total_relation_size_oid()/pg_total_relation_size_name() do the job?

Regards,
Qingqing