Re: lztext and compression ratios... - Mailing list pgsql-general

From JanWieck@t-online.de (Jan Wieck)
Subject Re: lztext and compression ratios...
Date
Msg-id 200007052216.AAA12189@hot.jw.home
Whole thread Raw
In response to lztext and compression ratios...  (Jeffery Collins <collins@onyx-technologies.com>)
Responses Re: lztext and compression ratios...  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Jeffery Collins wrote:
> I have been looking at using the lztext type and I have some
> questions/observations.   Most of my experience comes from attempting to
> compress text records in a different database (CTREE), but I think the
> experience is transferable.

    First   of   all  I  welcome  any  suggestions/input  to  the
    compression code I've implemented. Seems  you're  experienced
    in  this  area so keep going and hammer down any of my points
    below.

    You should know that the "lztext" type  will  disappear  soon
    and  be  replaced  with  the  general  "all varlena types are
    potentially compressable" approach of TOAST.

> My typical table consists of variable length text records.  The average
> length record is around 1K bytes.  I would like to compress my records
> to save space and improve I/O performance (smaller records means more
> records fit into the file system cache which means less I/O - or so the
> theory goes).  I am not too concerned about CPU as we are using a 4-way
> Sun Enterprise class server.  So compress seems like a good idea to me.
>
> My experience with attempting to compress such a relatively small
> (around 1K) text string is that the compression ration is not very
> good.  This is because the string is not long enough for the LZ
> compression algorithm to establish really good compression patterns and
> the fact that the de-compression table has to be built into each
> record.  What I have done in the past to get around these problems is
> that I have "taught" the compression algorithm the patterns ahead of
> time and stored the de-compression patterns in an external table.  Using
> this technique, I have achieved *much* better compression ratios.

    The compression algorithm used in "lztext" (and so  in  TOAST
    in  the  future)  doesn't have a de-compression table at all.
    It's  based  on  Adisak  Pochanayon's  SLZ  algorithm,  using
    <literal_char>  or  <token>.  A <token> just tells how far to
    go back in the OUTPUT-buffer and how many bytes to copy  from
    OUTPUT to OUTPUT. Look at the code for details.

    My  design  rules for the compression inside of Postgres have
    been

        -   beeing fast on decompression
        -   beeing good for relatively small values
        -   beeing fast on compression

    The first rule is met by  the  implementation  itself.  Don't
    underestimate  this  design rule! Usually you don't update as
    often as you query. And the implementation of TOAST  requires
    a super fast decompression.

    The   second   rule   is  met  by  not  needing  any  initial
    decompression table inside of the stored value.

    The third rule is controlled by the default strategy  of  the
    algorithm,        (unfortunately)        hardcoded       into
    utils/adt/pg_lzcompress.c.  It'll never try to compress items
    smaller  than 256 bytes. It'll fallback to plain storage (for
    speed advantage while decompressing a value) if less than 20%
    of  compression  is  gained.  It'll  stop  match loookup if a
    backward match of 128 or more bytes is found.

> So my questions/comments are:
>
>     - What are the typical compression rations on relatively small (i.e.
> around 1K) strings seen with lztext?

    Don't have that small items handy. But a table  containing  a
    file path and it's content. All files where HTML files.

        From - To      | count(*) | avg(length) | avg(octet_length)
        ---------------+----------+-------------+------------------
        1024 - 2047    |       14 |        1905 |              1470
        2048 - 4095    |       67 |        3059 |              1412
        4096 - 8191    |       45 |        5384 |              2412
        8192 -         |       25 |       17200 |              6323
        ---------------+----------+-------------+------------------
        all            |      151 |        5986 |              2529

>     - Does anyone see a need/use for a generalized string compression
> type that can be "trained" external to the individual records?

    Yes, of course. Maybe "lztext" can be a framework for you and
    we just tell the toaster "never apply your lousy  compression
    on that" (it's prepared for).

>     - Am I crazy in even attempting to compress strings of this relative
> size?  My largest table correct contains about 2 million entries of
> roughly 1k size strings or about 2Gig of data.  If I could compress this
> to about 33% of it's original size (not unreasonable with a trained LZ
> compression), I would save a lot of disk space (not really important)
> and a lot of file system cache space (very important) and be able to fit
> the entire table into memory (very, very important).

    Noone is crazy attempting to improve something. It might turn
    out not to work well, or beeing brain damaged from the start.
    But someone who never tries will miss all his chances.

    Final note:

    One  thing  to  keep  in mind is that the LZ algorithm you're
    thinking of must be distributable under the terms of the  BSD
    license.  If it's copyrighted or patented by any third party,
    not agreeing to these terms, it's out of discussion and  will
    never  appear in the Postgres source tree.  Especially the LZ
    algorithm used in GIF is one of these show-stoppers.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



pgsql-general by date:

Previous
From: JanWieck@t-online.de (Jan Wieck)
Date:
Subject: Re: proposed improvements to PostgreSQL license
Next
From: Tom Lane
Date:
Subject: Re: lztext and compression ratios...