Re: lztext and compression ratios... - Mailing list pgsql-general

From Jeffery Collins
Subject Re: lztext and compression ratios...
Date
Msg-id 3963CDDA.43643AF7@onyx-technologies.com
Whole thread Raw
In response to Re: lztext and compression ratios...  (JanWieck@t-online.de (Jan Wieck))
List pgsql-general
Jan,

Thank you for your comments on lztext.  They were very informative.  I hope
you (and the rest of the list) don't mind my ramblings on compression, but I
think there is some performance/space advantage if compression can integrated
into a table.  I admit to not being a compression expert.  All that I know
has been learn from reading some zlib source.

The compression that I have been working with (and am more familiar with) is
Huffman compression.  You are probably familiar with it and may have even
evaluated it when you were looking at compression techniques.  In my limited
understanding of LZ compression, there are some advantages and disadvantages
to Huffman compression.

The disadvantages seem to be:
    - In normal Huffman compression, the string to be compressed is examined
twice.  Once to build the translation table and once to do the translation.
    - The translation table must be written with the string.  This makes the
result larger and limits it's effectiveness with small strings.
    - I don't know for sure, but I wouldn't be surprised if uncompression is
slower with Huffman than LZ compression.

To get around these disadvantages, I modified my Huffman support in the
following way:
    - I trained the translation table and stored the table separately.  I was
able to do this because I knew before hand the type of text that would be in
my columns.
    - I "improved" the compression by giving the translation table the
ability to look for well known substrings in addition to byte values.  For
example the string "THE" in an English text might have it's own Huffman
translation value instead of relying on the individual T, H and E
translations.  Again I was able to do this because I knew the type of text I
would be storing in the column.

Because the translation table is external to the string, it is no longer
included in the string.  This improves the compression ratio and helps the
uncompression as the table does not need to be read and interpreted.  This
approach also allows for and takes advantage of cross-row commonalities.
After doing all of this, I was able to get the average compression ratio to
about 3.5 (i.e. orig size / new size) on text columns of 1K or less.

Of course the disadvantage to this is that if the dataset changes overtime,
the compression translations to not change with it and the compression ratio
will get worse.  In the very worst case, where the dataset totally changes
(say the language changed from English to Russian) the "compressed" string
could actually get larger than the original string because everything the
compression algorithm thought it knew was totally wrong.

As I said, I did this with a different database (CTREE).  My company is now
converting from CTREE to postgresql and I would like to find some fairly
elegant way to include this type of compression support.  My CTREE
implementation was rather a hack, but now that we are moving to a better
database, I would like to have a better solution for compression.

Originally I latched onto the idea of using lztext or something like lztext,
but I agree it is better if the column type is something standard and the
compression is hidden by the backend.  I guess my vision would be to be able
to define either a column type or a column attribute that would allow me to
indicate the type of compression to have and possibly the compression
algorithm and translation set to use.  Actually the real vision is to have it
all hidden and have the database learn about the column as rows are added and
modify the compression transalation to adapt to the changing column data.

Jeff




pgsql-general by date:

Previous
From: Jim Wise
Date:
Subject: Re: [HACKERS] Re: Revised Copyright: is this more palatable?
Next
From: Renato Moutinho
Date:
Subject: Two questions in a row