Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios... - Mailing list pgsql-hackers
From | JanWieck@t-online.de (Jan Wieck) |
---|---|
Subject | Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios... |
Date | |
Msg-id | 200007101511.RAA11221@hot.jw.home Whole thread Raw |
In response to | Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios... (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...
Re: lztext and compression ratios... |
List | pgsql-hackers |
Tom Lane wrote: > JanWieck@t-online.de (Jan Wieck) writes: > > eisentrp@csis.gvsu.edu wrote: > >> Maybe you just want to use zlib. Let other guys hammer out the details. > > > We cannot assume that zlib is available everywhere. > > We can if we include it in our distribution --- which we could; it's > pretty small and uses a BSD-style license. I can assure you the zlib > guys would be happy with that. And it's certainly as portable as our > own code. The real question is, is a custom compressor enough better > than zlib for our purposes to make it worth taking any patent risks? Good, we shouldn't worry about that anymore. If we want to use zlib, I vote for including it into our distribution and link static against the one shipped with our code. If we want to ... > We could run zlib at a low compression setting (-z1 to -z3 maybe) > to make compression relatively fast, and since that also doesn't > generate a custom Huffman tree, the overhead in the compressed data > is minor even for short strings. And its memory footprint is > certainly no worse than Jan's method... Definitely not, it's memory footprint is actually much smaller. Thus, I need to recreate the comparision below again after making the history table of fixed size with a wrap around mechanism to get a small footprint on multi-MB inputs too. > The real question is whether zlib decompression is markedly slower > than Jan's code. Certainly Jan's method is a lot simpler and *should* > be faster --- but on the other hand, zlib has had a heck of a lot > of careful performance tuning put into it over the years. The speed > difference might not be as bad as all that. > > I think it's worth taking a look at the option. Some quick numbers though: I simply stripped down pg_lzcompress.c to call compress2() and uncompress() instead of doing anything itself (what a nice, small source file :-). There might be some room for improvement using static zlib stream allocaions and deflateReset(), inflateReset() or the like. But I don't expect a significant difference from that. The test is a Tcl (pgtclsh) script doing the following: - Loading 151 HTML files into a table t1 of structure (path text, content lztext). - SELECT * FROM t1 and checking for correct result set. Each file is read again during the check. - UPDATE t1 SET content = upper(content). SELECT * FROM t1 and checking for correct result set. Each file is read again, converted to upper caseusing Tcl's "string toupper" function for comparision. - SELECT path FROM t1. Loop over result set to UPDATE t1 SET content = <value> WHERE path = <path>. All filesare read again and converted to lower case before UPDATE. - SELECT * FROM t1 and check for correct result set. Files are again reread and lower case converted inTcl for comparision. - Doing 20 SELECT * FROM t1 to have alot more decompress than compress cycles. Of course, there's an index on path. Here are the timings and sizes: Compressor | level | heap size | toastrel | toastidx | seconds | | | size | size | -----------+-------+-----------+----------+----------+-------- PGLZ | - | 425,984 | 950,272 | 32,768| 5.20 zlib | 1 | 499,712 | 614,400 | 16,384 | 6.85 zlib | 3 | 499,712 | 557,056| 16,384 | 6.75 zlib | 6 | 491,520 | 524,288 | 16,384 | 7.10 zlib | 9 | 491,520| 524,288 | 16,384 | 7.21 Seconds is an average over multiple runs. Interesting is that compression level 3 seems to be faster than 1. Idouble checked it because it was so surprising. Also, increasing the number of SELECT * at the end increases the difference. So the PGLZ decompressor does a perfectjob. And what must be taken into account too is that the script, running on the same processor and doing all the overhead (reading files, doing case conversions, quoting values with regsub and comparisions), along with the normalPostgres query execution (parsing, planning, optimizing, execution) occupies a substantial portion of thebare runtime. Still PGLZ is about 25% faster than the best zlib compression level I'm seeing, while zlib gains amuch better compression ratio (factor 1.7 at least). As I see it: If replacing the compressor/decompressor can cause a runtime difference of 25% in such a scenario, the pure difference between the two methods must be alot. PGLZ is what I mentioned in the comments. Optimized for speed on the cost of compression ratio. What I suggest: Leave PGLZ in place as the default compressor for toastable types. Speed is what all benchmarks talk about - on disk storage size is seldom a minor note. Fix it's history allocation for huge values and have someone (PgSQL Inc.?) patenting the compression algorithm, so we're safe at some point in the future. If there's a patent problem in it, we are already running the risk to getsued, the PGLZ code got shipped with 7.0, used in lztext. We can discuss about enabling zlib as a per attribute configurable alternative further. But is the confusion this might cause worth it all? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
pgsql-hackers by date: