Re: lztext and compression ratios... - Mailing list pgsql-hackers

From Tom Lane
Subject Re: lztext and compression ratios...
Date
Msg-id 28549.963472077@sss.pgh.pa.us
Whole thread Raw
In response to Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...  (JanWieck@t-online.de (Jan Wieck))
List pgsql-hackers
JanWieck@t-online.de (Jan Wieck) writes:
>     Some quick numbers though:
>     I  simply  stripped  down pg_lzcompress.c to call compress2()
>     and uncompress() instead of doing  anything  itself  (what  a
>     nice,  small  source  file  :-).

I went at it in a different way: pulled out pg_lzcompress into a
standalone source program that could also call zlib.  These numbers
represent the pure compression or decompression time for memory-to-
memory processing, no other overhead at all.  Each run was iterated
1000 times to make it long enough to time accurately (so you can
read the times as "milliseconds per operation", though they're
really seconds).

A small text file of about 1K:

LZ compressed 990 bytes to 717 bytes in 0.35 u 0.00 sys sec
LZ decompressed 717 bytes to 990 bytes in 0.04 u 0.00 sys sec
zlib(3) compressed 990 bytes to 527 bytes in 0.73 u 0.00 sys sec
zlib uncompressed 527 bytes to 990 bytes in 0.21 u 0.00 sys sec
zlib(6) compressed 990 bytes to 513 bytes in 0.86 u 0.00 sys sec
zlib uncompressed 513 bytes to 990 bytes in 0.21 u 0.00 sys sec
zlib(9) compressed 990 bytes to 513 bytes in 0.86 u 0.00 sys sec
zlib uncompressed 513 bytes to 990 bytes in 0.20 u 0.00 sys sec

A larger text file, a bit under 100K:

LZ compressed 92343 bytes to 54397 bytes in 49.00 u 0.05 sys sec
LZ decompressed 54397 bytes to 92343 bytes in 3.74 u 0.00 sys sec
zlib(3) compressed 92343 bytes to 40524 bytes in 38.48 u 0.02 sys sec
zlib uncompressed 40524 bytes to 92343 bytes in 8.58 u 0.00 sys sec
zlib(6) compressed 92343 bytes to 38002 bytes in 68.04 u 0.03 sys sec
zlib uncompressed 38002 bytes to 92343 bytes in 8.13 u 0.00 sys sec
zlib(9) compressed 92343 bytes to 37961 bytes in 71.99 u 0.03 sys sec
zlib uncompressed 37961 bytes to 92343 bytes in 8.14 u 0.00 sys sec

It looks like the ultimate speed difference is roughly 2:1 on
decompression and less than that on compression, though zlib suffers
from greater setup overhead that makes it look worse on short strings.

Here is a much different example, a small JPEG image:

LZ compressed 5756 bytes to 5764 bytes in 1.48 u 0.00 sys sec
LZ decompressed 5764 bytes to 5756 bytes in 0.01 u 0.00 sys sec
zlib(3) compressed 5756 bytes to 5629 bytes in 3.18 u 0.00 sys sec
zlib uncompressed 5629 bytes to 5756 bytes in 0.75 u 0.00 sys sec
zlib(6) compressed 5756 bytes to 5625 bytes in 3.78 u 0.00 sys sec
zlib uncompressed 5625 bytes to 5756 bytes in 0.75 u 0.00 sys sec
zlib(9) compressed 5756 bytes to 5625 bytes in 3.78 u 0.00 sys sec
zlib uncompressed 5625 bytes to 5756 bytes in 0.75 u 0.00 sys sec

The PG-LZ code realizes it cannot compress this file and falls back to
straight bytewise copy, yielding very fast "decompress".  zlib manages
to eke out a few bytes' savings, so it goes ahead and compresses
instead of just storing literally, resulting in a big loss on
decompression speed for just a fractional space savings.  If we use
zlib, we'd probably be well advised to accept a compressed result only
if it's at least, say, 25% smaller than uncompressed to avoid this
low-return-on-investment situation.

>     There might be some room for
>     improvement  using  static   zlib   stream   allocaions   and
>     deflateReset(),  inflateReset()  or  the  like.  But  I don't
>     expect a significant difference from that.

It turns out you can get a noticeable win from overriding zlib's
default memory allocator, which calls calloc() even though it has
no need for pre-zeroed storage.  Just making it use malloc() instead
saves about 25% of the runtime on 1K-sized strings (the above numbers
include this improvement).  We'd probably want to make it use palloc
instead of malloc anyway, so this won't cost us any extra work.  I
have not pushed on it to see what else might be gained by tweaking.

>     What I suggest:

>     Leave  PGLZ  in place as the default compressor for toastable
>     types.  Speed is what all benchmarks talk  about  -  on  disk
>     storage size is seldom a minor note.

True, but the other side of the coin is that better compression and
smaller disk space will mean less I/O, better use of shared-memory
disk buffers, etc.  So to some extent, better compression helps pay
for itself.

>     Fix  it's history allocation for huge values and have someone
>     (PgSQL Inc.?)  patenting the compression algorithm, so  we're
>     safe at some point in the future.

That would be a really *bad* idea.  What will people say if we say
"Postgres contains patented algorithms, but we'll let you use them
for free" ?  They'll say "no thanks, I remember Unisys' repeatedly
broken promises about the GIF patent" and stay away in droves.
There is a *lot* of bad blood in the air about compression patents
of any sort.  We mustn't risk tainting Postgres' reputation with
that mess.

(In any case, one would hope you couldn't get a patent on this
method, though I suppose it never pays to overestimate the competence
of the USPTO...)

>     If there's a patent problem
>     in it, we are already running the risk to get sued, the  PGLZ
>     code got shipped with 7.0, used in lztext.

But it hasn't been documented or advertised.  If we take it out
again in 7.1, I think our exposure to potential lawsuits from it is
negligible.  Not that I think there is any big risk there anyway,
but we ought to consider the possibility.

My feeling is that going with zlib is probably the right choice.
The technical case for using a homebrew compressor instead isn't
very compelling, and the advantages of using a standardized,
known-patent-free library are not to be ignored.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Zeugswetter Andreas SB
Date:
Subject: AW: Re: Query 'Bout A Bug.
Next
From: Philip Warner
Date:
Subject: Re: Re: postgres TODO