Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios... - Mailing list pgsql-hackers

From JanWieck@t-online.de (Jan Wieck)
Subject Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...
Date
Msg-id 200007101511.RAA11221@hot.jw.home
Whole thread Raw
In response to Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...
Re: lztext and compression ratios...
List pgsql-hackers
Tom Lane wrote:
> JanWieck@t-online.de (Jan Wieck) writes:
> > eisentrp@csis.gvsu.edu wrote:
> >> Maybe you just want to use zlib. Let other guys hammer out the details.
>
> >     We  cannot  assume that zlib is available everywhere.
>
> We can if we include it in our distribution --- which we could; it's
> pretty small and uses a BSD-style license.  I can assure you the zlib
> guys would be happy with that.  And it's certainly as portable as our
> own code.  The real question is, is a custom compressor enough better
> than zlib for our purposes to make it worth taking any patent risks?
   Good,  we  shouldn't  worry about that anymore. If we want to   use zlib, I vote for including it into our
distribution and   link static against the one shipped with our code.
 
   If we want to ...

> We could run zlib at a low compression setting (-z1 to -z3 maybe)
> to make compression relatively fast, and since that also doesn't
> generate a custom Huffman tree, the overhead in the compressed data
> is minor even for short strings.  And its memory footprint is
> certainly no worse than Jan's method...
   Definitely  not,  it's  memory  footprint  is  actually  much   smaller.  Thus, I need  to  recreate  the
comparision below   again  after  making  the  history table of fixed size with a   wrap around mechanism to get a
small footprint  on  multi-MB   inputs too.
 

> The real question is whether zlib decompression is markedly slower
> than Jan's code.  Certainly Jan's method is a lot simpler and *should*
> be faster --- but on the other hand, zlib has had a heck of a lot
> of careful performance tuning put into it over the years.  The speed
> difference might not be as bad as all that.
>
> I think it's worth taking a look at the option.
   Some quick numbers though:
   I  simply  stripped  down pg_lzcompress.c to call compress2()   and uncompress() instead of doing  anything  itself
(what a   nice,  small  source  file  :-). There might be some room for   improvement  using  static   zlib   stream
allocaions  and   deflateReset(),  inflateReset()  or  the  like.  But  I don't   expect a significant difference from
that.
   The test is a Tcl (pgtclsh) script doing the following:
   -   Loading 151 HTML files into a table t1 of structure (path       text, content lztext).
   -   SELECT  *  FROM  t1  and checking for correct result set.       Each file is read again during the check.
   -   UPDATE t1 SET content = upper(content).
   ­   SELECT * FROM t1 and checking  for  correct  result  set.       Each  file  is  read again, converted to upper
caseusing       Tcl's "string toupper" function for comparision.
 
   -   SELECT path FROM t1. Loop over result set  to  UPDATE  t1       SET content = <value> WHERE path = <path>.  All
filesare       read again and converted to lower case before UPDATE.
 
   -   SELECT * FROM t1 and check for correct result set.  Files       are  again  reread  and  lower  case converted
inTcl for       comparision.
 
   -   Doing 20 SELECT * FROM t1 to have  alot  more  decompress       than compress cycles.
   Of course, there's an index on path. Here are the timings and   sizes:
   Compressor | level | heap size | toastrel | toastidx | seconds              |       |           |   size   |   size
|   -----------+-------+-----------+----------+----------+--------   PGLZ       |   -   |   425,984 |  950,272 |
32,768|    5.20   zlib       |   1   |   499,712 |  614,400 |   16,384 |    6.85   zlib       |   3   |   499,712 |
557,056|   16,384 |    6.75   zlib       |   6   |   491,520 |  524,288 |   16,384 |    7.10   zlib       |   9   |
491,520|  524,288 |   16,384 |    7.21
 
   Seconds is an average over multiple runs. Interesting is that   compression  level  3  seems  to  be  faster than 1.
Idouble   checked it because it was so surprising.
 
   Also, increasing the number of SELECT * at the end  increases   the  difference. So the PGLZ decompressor does a
perfectjob.
 
   And what must be taken into account too is that  the  script,   running  on  the  same  processor  and doing all the
overhead  (reading files, doing case conversions, quoting  values  with   regsub  and  comparisions),  along  with  the
normalPostgres   query execution (parsing,  planning,  optimizing,  execution)   occupies  a  substantial  portion  of
thebare runtime. Still   PGLZ is about 25% faster than the best zlib compression level   I'm  seeing, while zlib gains
amuch better compression ratio   (factor 1.7 at least).
 
   As I see it:
   If replacing the compressor/decompressor can cause a  runtime   difference  of  25%  in  such a scenario, the pure
difference  between the two methods must be alot.
 
   PGLZ is what I mentioned in the comments. Optimized for speed   on the cost of compression ratio.
   What I suggest:
   Leave  PGLZ  in place as the default compressor for toastable   types.  Speed is what all benchmarks talk  about  -
on disk   storage size is seldom a minor note.
 
   Fix  it's history allocation for huge values and have someone   (PgSQL Inc.?)  patenting the compression algorithm,
so we're   safe at some point in the future. If there's a patent problem   in it, we are already running the risk to
getsued, the  PGLZ   code got shipped with 7.0, used in lztext.
 
   We  can  discuss  about  enabling  zlib  as  a  per attribute   configurable alternative further. But is the
confusion this   might cause worth it all?
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: AW: more corruption
Next
From: Tom Lane
Date:
Subject: Re: memory: bug or feature