Re: lztext and compression ratios... - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: lztext and compression ratios... |
Date | |
Msg-id | 28549.963472077@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios... (JanWieck@t-online.de (Jan Wieck)) |
List | pgsql-hackers |
JanWieck@t-online.de (Jan Wieck) writes: > Some quick numbers though: > I simply stripped down pg_lzcompress.c to call compress2() > and uncompress() instead of doing anything itself (what a > nice, small source file :-). I went at it in a different way: pulled out pg_lzcompress into a standalone source program that could also call zlib. These numbers represent the pure compression or decompression time for memory-to- memory processing, no other overhead at all. Each run was iterated 1000 times to make it long enough to time accurately (so you can read the times as "milliseconds per operation", though they're really seconds). A small text file of about 1K: LZ compressed 990 bytes to 717 bytes in 0.35 u 0.00 sys sec LZ decompressed 717 bytes to 990 bytes in 0.04 u 0.00 sys sec zlib(3) compressed 990 bytes to 527 bytes in 0.73 u 0.00 sys sec zlib uncompressed 527 bytes to 990 bytes in 0.21 u 0.00 sys sec zlib(6) compressed 990 bytes to 513 bytes in 0.86 u 0.00 sys sec zlib uncompressed 513 bytes to 990 bytes in 0.21 u 0.00 sys sec zlib(9) compressed 990 bytes to 513 bytes in 0.86 u 0.00 sys sec zlib uncompressed 513 bytes to 990 bytes in 0.20 u 0.00 sys sec A larger text file, a bit under 100K: LZ compressed 92343 bytes to 54397 bytes in 49.00 u 0.05 sys sec LZ decompressed 54397 bytes to 92343 bytes in 3.74 u 0.00 sys sec zlib(3) compressed 92343 bytes to 40524 bytes in 38.48 u 0.02 sys sec zlib uncompressed 40524 bytes to 92343 bytes in 8.58 u 0.00 sys sec zlib(6) compressed 92343 bytes to 38002 bytes in 68.04 u 0.03 sys sec zlib uncompressed 38002 bytes to 92343 bytes in 8.13 u 0.00 sys sec zlib(9) compressed 92343 bytes to 37961 bytes in 71.99 u 0.03 sys sec zlib uncompressed 37961 bytes to 92343 bytes in 8.14 u 0.00 sys sec It looks like the ultimate speed difference is roughly 2:1 on decompression and less than that on compression, though zlib suffers from greater setup overhead that makes it look worse on short strings. Here is a much different example, a small JPEG image: LZ compressed 5756 bytes to 5764 bytes in 1.48 u 0.00 sys sec LZ decompressed 5764 bytes to 5756 bytes in 0.01 u 0.00 sys sec zlib(3) compressed 5756 bytes to 5629 bytes in 3.18 u 0.00 sys sec zlib uncompressed 5629 bytes to 5756 bytes in 0.75 u 0.00 sys sec zlib(6) compressed 5756 bytes to 5625 bytes in 3.78 u 0.00 sys sec zlib uncompressed 5625 bytes to 5756 bytes in 0.75 u 0.00 sys sec zlib(9) compressed 5756 bytes to 5625 bytes in 3.78 u 0.00 sys sec zlib uncompressed 5625 bytes to 5756 bytes in 0.75 u 0.00 sys sec The PG-LZ code realizes it cannot compress this file and falls back to straight bytewise copy, yielding very fast "decompress". zlib manages to eke out a few bytes' savings, so it goes ahead and compresses instead of just storing literally, resulting in a big loss on decompression speed for just a fractional space savings. If we use zlib, we'd probably be well advised to accept a compressed result only if it's at least, say, 25% smaller than uncompressed to avoid this low-return-on-investment situation. > There might be some room for > improvement using static zlib stream allocaions and > deflateReset(), inflateReset() or the like. But I don't > expect a significant difference from that. It turns out you can get a noticeable win from overriding zlib's default memory allocator, which calls calloc() even though it has no need for pre-zeroed storage. Just making it use malloc() instead saves about 25% of the runtime on 1K-sized strings (the above numbers include this improvement). We'd probably want to make it use palloc instead of malloc anyway, so this won't cost us any extra work. I have not pushed on it to see what else might be gained by tweaking. > What I suggest: > Leave PGLZ in place as the default compressor for toastable > types. Speed is what all benchmarks talk about - on disk > storage size is seldom a minor note. True, but the other side of the coin is that better compression and smaller disk space will mean less I/O, better use of shared-memory disk buffers, etc. So to some extent, better compression helps pay for itself. > Fix it's history allocation for huge values and have someone > (PgSQL Inc.?) patenting the compression algorithm, so we're > safe at some point in the future. That would be a really *bad* idea. What will people say if we say "Postgres contains patented algorithms, but we'll let you use them for free" ? They'll say "no thanks, I remember Unisys' repeatedly broken promises about the GIF patent" and stay away in droves. There is a *lot* of bad blood in the air about compression patents of any sort. We mustn't risk tainting Postgres' reputation with that mess. (In any case, one would hope you couldn't get a patent on this method, though I suppose it never pays to overestimate the competence of the USPTO...) > If there's a patent problem > in it, we are already running the risk to get sued, the PGLZ > code got shipped with 7.0, used in lztext. But it hasn't been documented or advertised. If we take it out again in 7.1, I think our exposure to potential lawsuits from it is negligible. Not that I think there is any big risk there anyway, but we ought to consider the possibility. My feeling is that going with zlib is probably the right choice. The technical case for using a homebrew compressor instead isn't very compelling, and the advantages of using a standardized, known-patent-free library are not to be ignored. regards, tom lane
pgsql-hackers by date: