Thread: Re: [GENERAL] lztext and compression ratios...

Re: [GENERAL] lztext and compression ratios...

From
JanWieck@t-online.de (Jan Wieck)
Date:
Tom Lane wrote:
> JanWieck@t-online.de (Jan Wieck) writes:
> >     One  thing  to  keep  in mind is that the LZ algorithm you're
> >     thinking of must be distributable under the terms of the  BSD
> >     license.  If it's copyrighted or patented by any third party,
> >     not agreeing to these terms, it's out of discussion and  will
> >     never  appear in the Postgres source tree.  Especially the LZ
> >     algorithm used in GIF is one of these show-stoppers.
>
> As long as you brought it up: how sure are you that the method you've
> used is not subject to any patents?  The mere fact that you learned
> it from someone who didn't patent it does not guarantee anything ---
> someone else could have invented it independently and filed for a
> patent.

    Now that you ask for it: I'm not sure. Could be.

> If you can show that this method uses no ideas not found in zlib,
> then I'll feel reassured --- a good deal of legal research went into
> zlib to make sure it didn't fall foul of any patents, and zlib has
> now been around long enough that it'd be tough for anyone to get a
> new patent on one of its techniques.  But if SLZ has any additional
> ideas in it, then we could be in trouble.  There are an awful lot of
> compression patents.

    To  do  so  I  don't know enough about the algorithms used in
    zlib.  Is there someone out here who could verify that  if  I
    detailed enough describe what our compression code does?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: [GENERAL] lztext and compression ratios...

From
Tom Lane
Date:
JanWieck@t-online.de (Jan Wieck) writes:
>> As long as you brought it up: how sure are you that the method you've
>> used is not subject to any patents?

>     Now that you ask for it: I'm not sure. Could be.

>> If you can show that this method uses no ideas not found in zlib,
>> then I'll feel reassured

>     To  do  so  I  don't know enough about the algorithms used in
>     zlib.  Is there someone out here who could verify that  if  I
>     detailed enough describe what our compression code does?

After a quick look at the code, I don't think there is anything
problematic about the data representation or the decompression
algorithm.  The compression algorithm is another story, and it's
not real well commented :-(.  The important issues are how you
search for matches in the past text and how you decide which match
is the best one to use.  Please update the code comments to describe
that, and I'll take another look.

            regards, tom lane

Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
JanWieck@t-online.de (Jan Wieck)
Date:
Tom Lane wrote:
> JanWieck@t-online.de (Jan Wieck) writes:
> >> As long as you brought it up: how sure are you that the method you've
> >> used is not subject to any patents?
>
> >     Now that you ask for it: I'm not sure. Could be.
>
> >> If you can show that this method uses no ideas not found in zlib,
> >> then I'll feel reassured
>
> >     To  do  so  I  don't know enough about the algorithms used in
> >     zlib.  Is there someone out here who could verify that  if  I
> >     detailed enough describe what our compression code does?
>
> After a quick look at the code, I don't think there is anything
> problematic about the data representation or the decompression
> algorithm.  The compression algorithm is another story, and it's
> not real well commented :-(.  The important issues are how you
> search for matches in the past text and how you decide which match
> is the best one to use.  Please update the code comments to describe
> that, and I'll take another look.

    Done. You'll find a new section in the top comments.

    While  writing  it  I  noticed  that  the algorithm is really
    expensive for big items. The history lookup  table  allocated
    is  8  times (on 32 bit architectures) the size of the input.
    So if you want to have 1MB compressed, it'll allocate 8MB for
    the  history.  It  hit  me  when  I  was hunting a bug in the
    toaster earlier today. Doing an update to a toasted  item  of
    5MB, resulting in a new value of 10MB, the backend blew up to
    290MB of virtual memory - oh boy. I definitely need  to  make
    that smarter.

    When  I wrote it I never thought about items that big. It was
    before we had the idea of TOAST.

    This all might  open  another  discussion  I'll  start  in  a
    separate thread.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: lztext and compression ratios...

From
Tom Lane
Date:
JanWieck@t-online.de (Jan Wieck) writes:
> Tom Lane wrote:
>> After a quick look at the code, I don't think there is anything
>> problematic about the data representation or the decompression
>> algorithm.  The compression algorithm is another story, and it's
>> not real well commented :-(.  The important issues are how you
>> search for matches in the past text and how you decide which match
>> is the best one to use.  Please update the code comments to describe
>> that, and I'll take another look.

>     Done. You'll find a new section in the top comments.

I think we are probably OK.  Anyone who wants to check the issue might
want to start with http://www.faqs.org/faqs/compression-faq/,
particularly part 1 item 8.  The two patents I'm most concerned about
are Waterworth's (4701745) and Stac's (5016009) which both cover basic
variants of LZ77 (for full text of these or any other US patent consult
http://patent.womplex.ibm.com/).  But it looks to me like this
particular variant can be argued not to fall under either patent.
For example, Waterworth's patent specifies using 3-byte hash values
while you use 4-byte, and stupid as that is it's sufficient to get
you out from under.  (A more technically valid reason is that he
doesn't use hash collision chains.)  The Stac patent also specifies
a data structure considerably different from your hash chains.

To give you an idea of what's involved here, I attach an old message
from Jean-loup Gailly, author of gzip and zlib, explaining why he
believes gzip doesn't fall foul of these two patents.  Not all of
his reasons apply to your method, but I think enough do.  (BTW,
Gailly's a good guy --- I worked with him on the PNG project.
I think I'll write to him and see if he's got time to review this
code for patent problems.)

>     While  writing  it  I  noticed  that  the algorithm is really
>     expensive for big items. The history lookup  table  allocated
>     is  8  times (on 32 bit architectures) the size of the input.
>     So if you want to have 1MB compressed, it'll allocate 8MB for
>     the  history.  It  hit  me  when  I  was hunting a bug in the
>     toaster earlier today. Doing an update to a toasted  item  of
>     5MB, resulting in a new value of 10MB, the backend blew up to
>     290MB of virtual memory - oh boy. I definitely need  to  make
>     that smarter.

Yes, you need some smarter method for choosing the size of the
hash table... a braindead but possibly sufficient answer is to
have some hard limit on the size... or you could just use a
hard-wired constant size to begin with.  I think it's usually
considered wise to use a prime for the table size and reduce
the raw input bits modulo the prime.
        regards, tom lane


Article: 14604 of gnu.misc.discuss
Path: chorus!octave.chorus.fr!jloup
From: jloup@chorus.fr (Jean-loup Gailly)
Newsgroups: gnu.misc.discuss
Subject: Re: LPF, NY Times, GNU and compression algorithms
Message-ID: <7989@chorus.chorus.fr>
Date: 10 Mar 94 10:10:03 GMT
References: <RXN106.94Mar6103732@wilbur.cac.psu.edu> <HSU.94Mar7204324@laphroaig.cs.hut.fi>
<MARC.94Mar8090631@marc.watson.ibm.com>
Sender: news@chorus.chorus.fr
Distribution: gnu
Lines: 201

Marc Auslander <marc@watson.ibm.com> writes:

>   Two Stac patents in the case are 4701745 and 5016009.
>
>   From 4601745: [typo, meant: 4701745]
>
>   the data processing means including circuit means operable to check
>   whether a sequence of successive bytes to be processed identical with
>   a sequence of bytes already processed, and including a hash generating
>   means responsive to the application of a predetermined number of bytes ...
>
>   From 5016009
>
>   in order to perform the hashing function, a data compression system
>   includes certain hash data structures including a history array
>   pointer ...
>
>   also:
>
>   ... is found ..., encoding said matching string ... a variable length
>   indicator ..., said predetermined strategy ensuring that a matching
>   string of two characters of said input data stream is compressed to
>   less than said two characters of said input data stream.
>
>   So - on the face of it, why isn't gzip covered by these patents?

Let's take each patent in turn. A clarification first: the Stac patent
4,701,745 was not invented by Stac. It was initially owned by Ferranti
(UK) and only recently bought by Stac. This was a very profitable
acquisition (assuming it cost Stac far less than the $120 million they
won by using this patent against Microsoft).

(a) 4,701,745

This algorithm is now known as LZRW1, because Ross Williams reinvented
it independently later and posted it on comp.compression on April 22,
1991. Exactly the same algorithm has also been patented by Gibson and
Graybill (5,049,881). The patent office failed to recognize that
it was the same algorithm.

This algorithm uses LZ77 with hashing but no collision chains and
outputs unmatched bytes directly without further compression.  gzip
uses collisions chains of arbitrary length, and uses Huffman encoding
on the unmatched bytes:

- Claim 1 of the patent is restricted to (emphasis added by me):
   output means operable to APPLY to a transfer medium each byte of data     not forming part of such an identical
sequence;and
 
   ENCODING means responsive to the identification of such a sequence     to APPLY to the transfer medium an
identificationsignal which     identifies both the location in the input store of the previous     occurrence of the
sequenceof bytes and the number of bytes     contained in the sequence.
 
 The claim thus makes a clear distinction between "encoding" and "applying to the transfer medium". A system which
compressesthe unmatched bytes does not infringe this patent.
 

- The description of the patent and claim 2 make clear that the check for identity of the sequences of bytes is to be
madeonly once (no hash collision chains). Gzip performs an arbitrary number of such checks. The "means" enumerated in
claim1 specify what the hash table consists of, and this does not include any means for storing hash collision chains.
 

- Claim 2 also requires that *all* bytes participating in the hash function should be compared:
    A system as claimed in claim 1 in which the circuit means also     includes check means operable to check for
identitybetween EACH     of the said predetermined number of bytes in sequence and EACH of     a similar sequence of
bytescontained in the input store at a     location defined by a pointer read out from the temporary store at     said
address
 [in plain English, this is the check for hash collision]
     and to check whether identity exists between succeeding bytes in     each sequence of bytes, and a byte counter
operableto count the     number of identical bytes in each sequence.
 
 [this is the determination of the match length]
 Gzip never checks for equality of the third byte used in the hash function. The hash function is such that on a hash
hitwith equality of the first two bytes, the third byte necessarily matches.
 

- In addition, gzip uses a "lazy" evaluation of string matches.  Even when a match is found, gzip may encode (with
Huffmancoding) a single unmatched byte. This is done when gzip determines that it is more beneficial to parse the input
stringdifferently because a longer match follows. In the Waterworth patent, a string match is always encoded as a
(length,pointer) pair.
 

All other claims of the patent are dependent on claim 1
("a system as claimed in claim 1 in which ..."). Since gzip
does not infringe claim 1 it does not infringe the other claims.
In particular, claim 6 explicitly states that unmatched strings
are not compressed:
     A system as claimed in claim 5 in which the data receiving means     includes decoder means operable to separate
UNCOMPRESSEDbytes of     data from identification signals.
 

The gzip decoder never receives uncompressed bytes since all input is
compressed with Huffman coding [both literals and (length, offset) pairs].


The only "invention" in the Waterworth patent is the absence of hash
collision chains. The original description of the LZ77 algorithm
required searching for the true longest match, not just checking the
length of one particular match.  Using hashing for string searching
was very well known at the time of the patent application (March 86).
The patent description specifies explicitly that "Hash techniques are
well known and many differents forms of hash will be suitable".

The --fast option of gzip was on purpose made slower than possible
precisely because of the existence of the Waterworth patent.
See in particular the following part of the gzip TODO file:
 Add a super-fast compression method, suitable for implementing file systems with transparent compression. One problem
isthat the best candidate (lzrw1) is patented twice (Waterworth 4,701,745 and Gibson & Graybill 5,049,881).
 


(b) 5,016,009

This is standard LZ77 with hashing, and collisions resolved using
linked lists. There are several important restrictions which let gzip
escape from the patent:

- the linked lists are implemented only with offsets. The end of a chain is detected by adding together all offsets,
untilthe sum becomes greater than the size of the history buffer. gzip uses direct indices, and so detects the end of
thechains differently. The exact wording of claim 1 of the patent is:
 
  ... said data compression system comprising ... an offset array means  ... said method comprising the steps of ...
      calculating a difference between said history array pointer         and said pointer obtained from said hash
tablemeans,      storing said difference into said offset array means entry         pointed to by said history array
pointer,...
 
  gzip never calculates such a difference and does not have any offset  array.

- unmatched strings are emitted as literal bytes without any compression. gzip uses Huffman encoding on the unmatched
strings.This is the same argument as for the Waterworth patent.
 

- unmatched strings are preceded by  ... a "raw" data tag indicating that no matching data string was found
 gzip does not use such a tag because it uses a single Huffman table for both string literals and match lengths. It is
onlythe prefix property of Huffman codes which allows the decoder to distinguish the two cases. So there is not a
unique"raw" tag preceding all literals. This is not a minor point. It is one of the reasons giving gzip superior
compressionto that obtained with the Stac algorithm.
 

- a string match is always encoded as a (length, pointer) pair. Gzip uses a "lazy" evaluation of string matches as
describedabove for the Waterworth patent.
 

All other claims of the patent are dependent on claim 1 ("the method
of claim 1 wherein ..."). Since gzip does not infringe claim 1 it does
not infringe the other claims.  In any case, I have studied in detail
all the 77 claims to make sure that gzip does not infringe.

Unrelated note: this Stac patent is the only one where I found an
original and non obvious idea. The hash table is refreshed using a
incremental mechanism, so that the refresh overhead is distributed
among all input bytes. This allows the real response time necessary in
disk compressors such as Stacker (the Stac product). gzip does not use
this idea, and refreshes the hash table in a straightforward manner
every 32K bytes.


One final comment: I estimate that I have now spent more time studying
data compression patents than actually implementing data compression
algorithms.  I have a partial list of 318 data compression patents,
even though for the moment I restrict myself mostly to lossless
algorithms (ignoring most image compression patents).  Richard
Stallman has been *extremely* careful before accepting gzip as the GNU
compressor.  I continue to study new patents regularly. I would of
course very much prefer spending what's left of my spare time
improving the gzip compression algorithm instead of studying patents.
Some improvements that I would have liked to put in gzip have not been
incorporated because of patents.

In short, every possible precaution has been taken to make sure that
gzip isn't covered by patents.

Jean-loup Gailly, author of gzip.
jloup@chorus.fr


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
eisentrp@csis.gvsu.edu
Date:
Maybe you just want to use zlib. Let other guys hammer out the details.

On Thu, 6 Jul 2000, Jan Wieck wrote:

>     While  writing  it  I  noticed  that  the algorithm is really
>     expensive for big items. The history lookup  table  allocated
>     is  8  times (on 32 bit architectures) the size of the input.
>     So if you want to have 1MB compressed, it'll allocate 8MB for
>     the  history.  It  hit  me  when  I  was hunting a bug in the
>     toaster earlier today. Doing an update to a toasted  item  of
>     5MB, resulting in a new value of 10MB, the backend blew up to
>     290MB of virtual memory - oh boy. I definitely need  to  make
>     that smarter.
> 
>     When  I wrote it I never thought about items that big. It was
>     before we had the idea of TOAST.
> 
>     This all might  open  another  discussion  I'll  start  in  a
>     separate thread.
> 
> 
> Jan
> 
> --
> 
> #======================================================================#
> # It's easier to get forgiveness for being wrong than for being right. #
> # Let's break this rule - forgive me.                                  #
> #================================================== JanWieck@Yahoo.com #
> 
> 
> 

-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Tom Lane
Date:
eisentrp@csis.gvsu.edu writes:
> Maybe you just want to use zlib. Let other guys hammer out the details.

I've been wondering about that myself.  Jan, have you actually
benchmarked decompression speed on this method vs. stock zlib?
Unless the differential is huge we might want to just skip the
patent worries.

zlib has a BSD-style license, so we could include it in our
distribution to avoid worries about whether it's installed.

I've written to Jean-loup for his advice.  He's pretty busy these
days (CTO at some startup, I believe) but we'll see if he has time
to comment.
        regards, tom lane


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Bruce Momjian
Date:
> eisentrp@csis.gvsu.edu writes:
> > Maybe you just want to use zlib. Let other guys hammer out the details.
> 
> I've been wondering about that myself.  Jan, have you actually
> benchmarked decompression speed on this method vs. stock zlib?
> Unless the differential is huge we might want to just skip the
> patent worries.

If we found later that there was a patent problem, could we just issue a
new release with a new compression algorithm?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Unless the differential is huge we might want to just skip the
>> patent worries.

> If we found later that there was a patent problem, could we just issue a
> new release with a new compression algorithm?

Yeah, we could, and it could presumably even be a fully compatible
dot-release with no change to the on-disk representation.  That
representation and consequently the decompression algorithm are safe
enough, it's the details of the compressor's search for matching
patterns that are a patent minefield.

However, changing the code after-the-fact might not be enough to keep us
from being sued :-(.  I'd rather use something that's pretty well
established as being in the clear...
        regards, tom lane


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
JanWieck@t-online.de (Jan Wieck)
Date:
eisentrp@csis.gvsu.edu wrote:
> Maybe you just want to use zlib. Let other guys hammer out the details.
>
   We  cannot  assume that zlib is available everywhere. Thus it   must be determined during configure and where it
isn't,TOAST   can  only  move  off  values  to make tuples fit into blocks.   Since decompression of already in memory
itemsis alot faster   than doing an index scan on the TOAST table, I expect this to   make installations without zlib
damnedslow.
 
   And how should binary distributions like RPM's handle  it?  I   assume  that  this  problem is already on it's way
becauseof   the integration of zlib into pg_dump. The only way I  see  is   having  different  RPM's  for  each
possible combination of   available helper libs.  Or  is  there  another  way  to  work   around?
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #




Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Tom Lane
Date:
JanWieck@t-online.de (Jan Wieck) writes:
> eisentrp@csis.gvsu.edu wrote:
>> Maybe you just want to use zlib. Let other guys hammer out the details.

>     We  cannot  assume that zlib is available everywhere.

We can if we include it in our distribution --- which we could; it's
pretty small and uses a BSD-style license.  I can assure you the zlib
guys would be happy with that.  And it's certainly as portable as our
own code.  The real question is, is a custom compressor enough better
than zlib for our purposes to make it worth taking any patent risks?

We could run zlib at a low compression setting (-z1 to -z3 maybe)
to make compression relatively fast, and since that also doesn't
generate a custom Huffman tree, the overhead in the compressed data
is minor even for short strings.  And its memory footprint is
certainly no worse than Jan's method...

The real question is whether zlib decompression is markedly slower
than Jan's code.  Certainly Jan's method is a lot simpler and *should*
be faster --- but on the other hand, zlib has had a heck of a lot
of careful performance tuning put into it over the years.  The speed
difference might not be as bad as all that.

I think it's worth taking a look at the option.
        regards, tom lane


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Peter Eisentraut
Date:
Jan Wieck writes:

> > Maybe you just want to use zlib. Let other guys hammer out the details.

>     And how should binary distributions like RPM's handle  it?

They got all their fancy dependency mechanisms for that (which never work
well, but anyway).


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Philip Warner
Date:
At 21:47 7/07/00 +0200, Jan Wieck wrote:
>
>    And how should binary distributions like RPM's handle  it?  I
>    assume  that  this  problem is already on it's way because of
>    the integration of zlib into pg_dump. The only way I  see  is
>    having  different  RPM's  for  each  possible  combination of
>    available helper libs.  Or  is  there  another  way  to  work
>    around?

This remoinded me of a question I wanted to ask Unix people: other OSs I
use allow for dynamic linking, at runtime and in code, against shared
libraries, and I know Unix must allow this. The places where zlib is used
are pretty limited, so it might be worth considering doing the 'HAVE_ZLIB'
kinds of checks at runtime. Then one binary fits all...

Is this hard or easy - at least on machines with a libz.so?

Is it worth doing?

I guess the alternative on rpm is to create both: pg_dump.zlib and
pg_dump.nozlib, and install the right one?


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Bruce Momjian
Date:
> This remoinded me of a question I wanted to ask Unix people: other OSs I
> use allow for dynamic linking, at runtime and in code, against shared
> libraries, and I know Unix must allow this. The places where zlib is used
> are pretty limited, so it might be worth considering doing the 'HAVE_ZLIB'
> kinds of checks at runtime. Then one binary fits all...
> 
> Is this hard or easy - at least on machines with a libz.so?
> 
> Is it worth doing?
> 
> I guess the alternative on rpm is to create both: pg_dump.zlib and
> pg_dump.nozlib, and install the right one?

We do dynamic loading for functions.  Not sure if we want to load zlib
dynamically if we can help it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Tom Lane
Date:
>> This remoinded me of a question I wanted to ask Unix people: other OSs I
>> use allow for dynamic linking, at runtime and in code, against shared
>> libraries, and I know Unix must allow this. The places where zlib is used
>> are pretty limited, so it might be worth considering doing the 'HAVE_ZLIB'
>> kinds of checks at runtime. Then one binary fits all...

> We do dynamic loading for functions.  Not sure if we want to load zlib
> dynamically if we can help it.

Not bloody likely!  Do you want to be in a position where you restart
your postmaster and suddenly chunks of your database are inaccessible?
That's what could happen to you if someone moves or deletes libz.so.

I don't mind being dynamically linked to standard system shared libs;
if libc.so is busted then whether Postgres launches is the least of
your worries.  But dynamic dependence on an optional package that's
probably living in /usr/local strikes me as exceedingly risky.

If we do go with using zlib instead of homegrown code, I would recommend
building and statically linking to our own copy even if there is a copy
available on the system.  This will prevent cross-version compatibility
problems as well as where'd-my-library-go syndrome.  We cannot afford
those sorts of risks for something that could prevent us from reading
our database.
        regards, tom lane


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Bruce Momjian
Date:
> If we do go with using zlib instead of homegrown code, I would recommend
> building and statically linking to our own copy even if there is a copy
> available on the system.  This will prevent cross-version compatibility
> problems as well as where'd-my-library-go syndrome.  We cannot afford
> those sorts of risks for something that could prevent us from reading
> our database.

I don't know much about zlib, but it is hard to imagine it would have
the flexibility and optimization tuning of Jan's code.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Philip Warner
Date:
At 21:49 7/07/00 -0400, Tom Lane wrote:
>
>Not bloody likely!  Do you want to be in a position where you restart
>your postmaster and suddenly chunks of your database are inaccessible?
>That's what could happen to you if someone moves or deletes libz.so.

My question was limited to it's use in pg_dump; rather than basing
pg_dump's compression bahaviour on configure, base it on it's runtime
environment. But my guess is you still would be inclined, rather strongly,
against it.


>If we do go with using zlib instead of homegrown code

This begs the obvious question: should pg_dump be using Jan's compression
code? In all cases/when zlib is not available?


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I don't know much about zlib, but it is hard to imagine it would have
> the flexibility and optimization tuning of Jan's code.

Uh, I think you have that backwards ...
        regards, tom lane

PS: I do know a few things about zlib.


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
>> Not bloody likely!  Do you want to be in a position where you restart
>> your postmaster and suddenly chunks of your database are inaccessible?
>> That's what could happen to you if someone moves or deletes libz.so.

> My question was limited to it's use in pg_dump; rather than basing
> pg_dump's compression bahaviour on configure, base it on it's runtime
> environment. But my guess is you still would be inclined, rather strongly,
> against it.

pg_dump is rather a different case.  I would want to see a runtime
option *not* to use compression, in case you know you are going to
need to restore on another system where zlib (or whatever) isn't
available.  But making an uncompressed dump today doesn't invalidate
the compressed dump you made yesterday, nor vice versa.

>> If we do go with using zlib instead of homegrown code

> This begs the obvious question: should pg_dump be using Jan's compression
> code? In all cases/when zlib is not available?

Good point.  If we include zlib in the distribution it would be pretty
silly for pg_dump not to use it.  If we don't, then Peter's remarks
about not liking an environment-determined feature set are relevant.
        regards, tom lane


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Philip Warner
Date:
At 22:28 7/07/00 -0400, Tom Lane wrote:
>
>pg_dump is rather a different case.  I would want to see a runtime
>option *not* to use compression

That's there already; using -Z0 produces the same output as a mchine
without zlib.

>But making an uncompressed dump today doesn't invalidate
>the compressed dump you made yesterday, nor vice versa.

Not sure what you mean here.


>Good point.  If we include zlib in the distribution it would be pretty
>silly for pg_dump not to use it.  If we don't, then Peter's remarks
>about not liking an environment-determined feature set are relevant.

zlib seems to have been ported to zillions of architectures (must be a good
design ;-}), so it's probably pretty safe to put in the source tree, and
I'm sure any porting problems we encounter would be useful to the zlib
maintainers.

If we don't want to use zlib, then fefore using Jan's compression code in
pg_dump, I suppose I'd need to know that the output is compatible across
32/64 bit architectures and is not sensitive to other issues like endian-ness.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
"Oliver Elphick"
Date:
Peter Eisentraut wrote: >>     And how should binary distributions like RPM's handle  it? > >They got all their fancy
dependencymechanisms for that (which never work >well, but anyway).
 

Debian's work very well, because the Debian repository is centralised
under uniform quality control.

As a package maintainer, I can allow a package to depend on zlib -- in
which case the dependency will be satisfied by any version -- or I can
specify a versioned dependency.  I might have to provide the correct
library myself, if the zlib maintainer didn't want to maintain an old
version.  At the moment Debian provides zlib1g; if the major version
changes, I would expect to see this package remain and a new zlib2
package be created.

I would prefer a solution that did not _require_ me to maintain zlib
as well, if the dependency can be satisfied from another package.

-- 
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
PGP: 1024R/32B8FAA1: 97 EA 1D 47 72 3F 28 47  6B 7E 39 CC 56 E4 C1 47
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839  932A 614D 4C34 3E1D 0C1C
========================================   "Ask, and it shall be given you; seek, and ye shall     find; knock, and it
shallbe opened unto you."                                         Matthew 7:7 
 




Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios....

From
JanWieck@t-online.de (Jan Wieck)
Date:
Philip Warner wrote:
> At 21:49 7/07/00 -0400, Tom Lane wrote:
> >
> >Not bloody likely!  Do you want to be in a position where you restart
> >your postmaster and suddenly chunks of your database are inaccessible?
> >That's what could happen to you if someone moves or deletes libz.so.
>
> My question was limited to it's use in pg_dump; rather than basing
> pg_dump's compression bahaviour on configure, base it on it's runtime
> environment. But my guess is you still would be inclined, rather strongly,
> against it.
>
>
> >If we do go with using zlib instead of homegrown code
>
> This begs the obvious question: should pg_dump be using Jan's compression
> code? In all cases/when zlib is not available?
   It can't. My code is designed for in-memory attribute values.   It doesn't support streaming - which I assume is
requiredfor   pg_dump.
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #




Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
JanWieck@t-online.de (Jan Wieck)
Date:
Tom Lane wrote:
> JanWieck@t-online.de (Jan Wieck) writes:
> > eisentrp@csis.gvsu.edu wrote:
> >> Maybe you just want to use zlib. Let other guys hammer out the details.
>
> >     We  cannot  assume that zlib is available everywhere.
>
> We can if we include it in our distribution --- which we could; it's
> pretty small and uses a BSD-style license.  I can assure you the zlib
> guys would be happy with that.  And it's certainly as portable as our
> own code.  The real question is, is a custom compressor enough better
> than zlib for our purposes to make it worth taking any patent risks?
   Good,  we  shouldn't  worry about that anymore. If we want to   use zlib, I vote for including it into our
distribution and   link static against the one shipped with our code.
 
   If we want to ...

> We could run zlib at a low compression setting (-z1 to -z3 maybe)
> to make compression relatively fast, and since that also doesn't
> generate a custom Huffman tree, the overhead in the compressed data
> is minor even for short strings.  And its memory footprint is
> certainly no worse than Jan's method...
   Definitely  not,  it's  memory  footprint  is  actually  much   smaller.  Thus, I need  to  recreate  the
comparision below   again  after  making  the  history table of fixed size with a   wrap around mechanism to get a
small footprint  on  multi-MB   inputs too.
 

> The real question is whether zlib decompression is markedly slower
> than Jan's code.  Certainly Jan's method is a lot simpler and *should*
> be faster --- but on the other hand, zlib has had a heck of a lot
> of careful performance tuning put into it over the years.  The speed
> difference might not be as bad as all that.
>
> I think it's worth taking a look at the option.
   Some quick numbers though:
   I  simply  stripped  down pg_lzcompress.c to call compress2()   and uncompress() instead of doing  anything  itself
(what a   nice,  small  source  file  :-). There might be some room for   improvement  using  static   zlib   stream
allocaions  and   deflateReset(),  inflateReset()  or  the  like.  But  I don't   expect a significant difference from
that.
   The test is a Tcl (pgtclsh) script doing the following:
   -   Loading 151 HTML files into a table t1 of structure (path       text, content lztext).
   -   SELECT  *  FROM  t1  and checking for correct result set.       Each file is read again during the check.
   -   UPDATE t1 SET content = upper(content).
   ­   SELECT * FROM t1 and checking  for  correct  result  set.       Each  file  is  read again, converted to upper
caseusing       Tcl's "string toupper" function for comparision.
 
   -   SELECT path FROM t1. Loop over result set  to  UPDATE  t1       SET content = <value> WHERE path = <path>.  All
filesare       read again and converted to lower case before UPDATE.
 
   -   SELECT * FROM t1 and check for correct result set.  Files       are  again  reread  and  lower  case converted
inTcl for       comparision.
 
   -   Doing 20 SELECT * FROM t1 to have  alot  more  decompress       than compress cycles.
   Of course, there's an index on path. Here are the timings and   sizes:
   Compressor | level | heap size | toastrel | toastidx | seconds              |       |           |   size   |   size
|   -----------+-------+-----------+----------+----------+--------   PGLZ       |   -   |   425,984 |  950,272 |
32,768|    5.20   zlib       |   1   |   499,712 |  614,400 |   16,384 |    6.85   zlib       |   3   |   499,712 |
557,056|   16,384 |    6.75   zlib       |   6   |   491,520 |  524,288 |   16,384 |    7.10   zlib       |   9   |
491,520|  524,288 |   16,384 |    7.21
 
   Seconds is an average over multiple runs. Interesting is that   compression  level  3  seems  to  be  faster than 1.
Idouble   checked it because it was so surprising.
 
   Also, increasing the number of SELECT * at the end  increases   the  difference. So the PGLZ decompressor does a
perfectjob.
 
   And what must be taken into account too is that  the  script,   running  on  the  same  processor  and doing all the
overhead  (reading files, doing case conversions, quoting  values  with   regsub  and  comparisions),  along  with  the
normalPostgres   query execution (parsing,  planning,  optimizing,  execution)   occupies  a  substantial  portion  of
thebare runtime. Still   PGLZ is about 25% faster than the best zlib compression level   I'm  seeing, while zlib gains
amuch better compression ratio   (factor 1.7 at least).
 
   As I see it:
   If replacing the compressor/decompressor can cause a  runtime   difference  of  25%  in  such a scenario, the pure
difference  between the two methods must be alot.
 
   PGLZ is what I mentioned in the comments. Optimized for speed   on the cost of compression ratio.
   What I suggest:
   Leave  PGLZ  in place as the default compressor for toastable   types.  Speed is what all benchmarks talk  about  -
on disk   storage size is seldom a minor note.
 
   Fix  it's history allocation for huge values and have someone   (PgSQL Inc.?)  patenting the compression algorithm,
so we're   safe at some point in the future. If there's a patent problem   in it, we are already running the risk to
getsued, the  PGLZ   code got shipped with 7.0, used in lztext.
 
   We  can  discuss  about  enabling  zlib  as  a  per attribute   configurable alternative further. But is the
confusion this   might cause worth it all?
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #




Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From
Bruce Momjian
Date:
>     Compressor | level | heap size | toastrel | toastidx | seconds
>                |       |           |   size   |   size   |
>     -----------+-------+-----------+----------+----------+--------
>     PGLZ       |   -   |   425,984 |  950,272 |   32,768 |    5.20
>     zlib       |   1   |   499,712 |  614,400 |   16,384 |    6.85
>     zlib       |   3   |   499,712 |  557,056 |   16,384 |    6.75
>     zlib       |   6   |   491,520 |  524,288 |   16,384 |    7.10
>     zlib       |   9   |   491,520 |  524,288 |   16,384 |    7.21

Consider that the 25% slowness gets us a 35% disk reduction, and that
translates to fewer buffer blocks and disk accesses.  Seems there is a
clear tradeoff there.

>     If replacing the compressor/decompressor can cause a  runtime
>     difference  of  25%  in  such a scenario, the pure difference
>     between the two methods must be alot.
> 
>     Leave  PGLZ  in place as the default compressor for toastable
>     types.  Speed is what all benchmarks talk  about  -  on  disk
>     storage size is seldom a minor note.

True, disk storage is not the issue, but disk access are an issue.

>     We  can  discuss  about  enabling  zlib  as  a  per attribute
>     configurable alternative further. But is the  confusion  this
>     might cause worth it all?

I think we have to choose one proposal.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: lztext and compression ratios...

From
Tom Lane
Date:
JanWieck@t-online.de (Jan Wieck) writes:
>     Some quick numbers though:
>     I  simply  stripped  down pg_lzcompress.c to call compress2()
>     and uncompress() instead of doing  anything  itself  (what  a
>     nice,  small  source  file  :-).

I went at it in a different way: pulled out pg_lzcompress into a
standalone source program that could also call zlib.  These numbers
represent the pure compression or decompression time for memory-to-
memory processing, no other overhead at all.  Each run was iterated
1000 times to make it long enough to time accurately (so you can
read the times as "milliseconds per operation", though they're
really seconds).

A small text file of about 1K:

LZ compressed 990 bytes to 717 bytes in 0.35 u 0.00 sys sec
LZ decompressed 717 bytes to 990 bytes in 0.04 u 0.00 sys sec
zlib(3) compressed 990 bytes to 527 bytes in 0.73 u 0.00 sys sec
zlib uncompressed 527 bytes to 990 bytes in 0.21 u 0.00 sys sec
zlib(6) compressed 990 bytes to 513 bytes in 0.86 u 0.00 sys sec
zlib uncompressed 513 bytes to 990 bytes in 0.21 u 0.00 sys sec
zlib(9) compressed 990 bytes to 513 bytes in 0.86 u 0.00 sys sec
zlib uncompressed 513 bytes to 990 bytes in 0.20 u 0.00 sys sec

A larger text file, a bit under 100K:

LZ compressed 92343 bytes to 54397 bytes in 49.00 u 0.05 sys sec
LZ decompressed 54397 bytes to 92343 bytes in 3.74 u 0.00 sys sec
zlib(3) compressed 92343 bytes to 40524 bytes in 38.48 u 0.02 sys sec
zlib uncompressed 40524 bytes to 92343 bytes in 8.58 u 0.00 sys sec
zlib(6) compressed 92343 bytes to 38002 bytes in 68.04 u 0.03 sys sec
zlib uncompressed 38002 bytes to 92343 bytes in 8.13 u 0.00 sys sec
zlib(9) compressed 92343 bytes to 37961 bytes in 71.99 u 0.03 sys sec
zlib uncompressed 37961 bytes to 92343 bytes in 8.14 u 0.00 sys sec

It looks like the ultimate speed difference is roughly 2:1 on
decompression and less than that on compression, though zlib suffers
from greater setup overhead that makes it look worse on short strings.

Here is a much different example, a small JPEG image:

LZ compressed 5756 bytes to 5764 bytes in 1.48 u 0.00 sys sec
LZ decompressed 5764 bytes to 5756 bytes in 0.01 u 0.00 sys sec
zlib(3) compressed 5756 bytes to 5629 bytes in 3.18 u 0.00 sys sec
zlib uncompressed 5629 bytes to 5756 bytes in 0.75 u 0.00 sys sec
zlib(6) compressed 5756 bytes to 5625 bytes in 3.78 u 0.00 sys sec
zlib uncompressed 5625 bytes to 5756 bytes in 0.75 u 0.00 sys sec
zlib(9) compressed 5756 bytes to 5625 bytes in 3.78 u 0.00 sys sec
zlib uncompressed 5625 bytes to 5756 bytes in 0.75 u 0.00 sys sec

The PG-LZ code realizes it cannot compress this file and falls back to
straight bytewise copy, yielding very fast "decompress".  zlib manages
to eke out a few bytes' savings, so it goes ahead and compresses
instead of just storing literally, resulting in a big loss on
decompression speed for just a fractional space savings.  If we use
zlib, we'd probably be well advised to accept a compressed result only
if it's at least, say, 25% smaller than uncompressed to avoid this
low-return-on-investment situation.

>     There might be some room for
>     improvement  using  static   zlib   stream   allocaions   and
>     deflateReset(),  inflateReset()  or  the  like.  But  I don't
>     expect a significant difference from that.

It turns out you can get a noticeable win from overriding zlib's
default memory allocator, which calls calloc() even though it has
no need for pre-zeroed storage.  Just making it use malloc() instead
saves about 25% of the runtime on 1K-sized strings (the above numbers
include this improvement).  We'd probably want to make it use palloc
instead of malloc anyway, so this won't cost us any extra work.  I
have not pushed on it to see what else might be gained by tweaking.

>     What I suggest:

>     Leave  PGLZ  in place as the default compressor for toastable
>     types.  Speed is what all benchmarks talk  about  -  on  disk
>     storage size is seldom a minor note.

True, but the other side of the coin is that better compression and
smaller disk space will mean less I/O, better use of shared-memory
disk buffers, etc.  So to some extent, better compression helps pay
for itself.

>     Fix  it's history allocation for huge values and have someone
>     (PgSQL Inc.?)  patenting the compression algorithm, so  we're
>     safe at some point in the future.

That would be a really *bad* idea.  What will people say if we say
"Postgres contains patented algorithms, but we'll let you use them
for free" ?  They'll say "no thanks, I remember Unisys' repeatedly
broken promises about the GIF patent" and stay away in droves.
There is a *lot* of bad blood in the air about compression patents
of any sort.  We mustn't risk tainting Postgres' reputation with
that mess.

(In any case, one would hope you couldn't get a patent on this
method, though I suppose it never pays to overestimate the competence
of the USPTO...)

>     If there's a patent problem
>     in it, we are already running the risk to get sued, the  PGLZ
>     code got shipped with 7.0, used in lztext.

But it hasn't been documented or advertised.  If we take it out
again in 7.1, I think our exposure to potential lawsuits from it is
negligible.  Not that I think there is any big risk there anyway,
but we ought to consider the possibility.

My feeling is that going with zlib is probably the right choice.
The technical case for using a homebrew compressor instead isn't
very compelling, and the advantages of using a standardized,
known-patent-free library are not to be ignored.
        regards, tom lane


Re: lztext and compression ratios...

From
Hannu Krosing
Date:
Tom Lane wrote:
> 
> JanWieck@t-online.de (Jan Wieck) writes:
> >     Some quick numbers though:
> >     I  simply  stripped  down pg_lzcompress.c to call compress2()
> >     and uncompress() instead of doing  anything  itself  (what  a
> >     nice,  small  source  file  :-).
> 
> I went at it in a different way: pulled out pg_lzcompress into a
> standalone source program that could also call zlib.  These numbers
> represent the pure compression or decompression time for memory-to-
> memory processing, no other overhead at all.  Each run was iterated
> 1000 times to make it long enough to time accurately (so you can
> read the times as "milliseconds per operation", though they're
> really seconds).

We could just make this part extensible as well, like the rest of 
postgres, so we would have directory tree like 

/compressors /nullcompressor /lzcompress /zlib /lzo /bzip2 /my_new_supercompressor
/classic_huffman_for_uppercase_american_english

and select the desired compressor at compilt time, or even better, on 
field by field basis at runtime, so that field that stores mainly 
tar.gz-s at compression level 9 will use nullcompressor, and others 
will use what is best for them.

> 
> >     Fix  it's history allocation for huge values and have someone
> >     (PgSQL Inc.?)  patenting the compression algorithm, so  we're
> >     safe at some point in the future.
> 
> That would be a really *bad* idea.  What will people say if we say
> "Postgres contains patented algorithms, but we'll let you use them
> for free" ?  They'll say "no thanks, I remember Unisys' repeatedly
> broken promises about the GIF patent" and stay away in droves.
> There is a *lot* of bad blood in the air about compression patents
> of any sort.  We mustn't risk tainting Postgres' reputation with
> that mess.
> (In any case, one would hope you couldn't get a patent on this
> method, though I suppose it never pays to overestimate the competence
> of the USPTO...)

And AFAIK (IANAL ;) you can only patent previously _unpublished_ work,
even by the patent applicant.

> 
> >     If there's a patent problem
> >     in it, we are already running the risk to get sued, the  PGLZ
> >     code got shipped with 7.0, used in lztext.
> 
> But it hasn't been documented or advertised.  If we take it out
> again in 7.1, I think our exposure to potential lawsuits from it is
> negligible.  Not that I think there is any big risk there anyway,
> but we ought to consider the possibility.
> 
> My feeling is that going with zlib is probably the right choice.
> The technical case for using a homebrew compressor instead isn't
> very compelling,

Speed seems to be a good reason, if we can keep it up.

> and the advantages of using a standardized,
> known-patent-free library are not to be ignored.

OTOH, there are possibly patents on other part of postgres, 
like indexing, storage methods, the mere fact that something is 
stored in another relation, using 'Z' as a protocol character, etc. 
etc. So using a patent-free compression library does not help much.

So if PgSQL Inc. has lots of lawyers with nothing to do, they could do
some patent research and scare all developers away with their findings
;)

-------------
Hannu