Re: [HACKERS] compression in LO and other fields - Mailing list pgsql-hackers

From wieck@debis.com (Jan Wieck)
Subject Re: [HACKERS] compression in LO and other fields
Date
Msg-id m11mHDk-0003kLC@orion.SAPserv.Hamburg.dsh.de
Whole thread Raw
In response to Re: [HACKERS] compression in LO and other fields  (Karel Zak - Zakkr <zakkr@zf.jcu.cz>)
Responses Re: [HACKERS] compression in LO and other fields  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [HACKERS] compression in LO and other fields  (The Hermit Hacker <scrappy@hub.org>)
Re: [HACKERS] compression in LO and other fields  (Karel Zak - Zakkr <zakkr@zf.jcu.cz>)
List pgsql-hackers
Karel Zak - Zakkr wrote:

> On Fri, 12 Nov 1999, Jan Wieck wrote:
>
> >     I  already  made  some  tests  with  a type I called 'lztext'
> >     locally.  Only the input-/output-functions exist so  far  and
>
> I is your original implementation or you use any current compression
> code? I try bzip2, but output from this algorithm is total binary,
> I don't know how this use in PgSQL if in backend are all routines
> (in/out) use *char (yes, I'am newbie for PgSQL hacking:-).

    The  internal  storage  format is based on an article I found
    at:

        http://www.neutralzone.org/home/faqsys/docs/slz_art.txt

        Simple Compression using an LZ buffer
        Part 3 Revision 1.d:
        An introduction to compression on the Amiga by Adisak Pochanayon

        Freely Distributable as long as reproduced completely.
        Copyright 1993 Adisak Pochanayon

    I've written the code from scratch.

    The internal representation  is  binary,  for  sure.  It's  a
    PostgreSQL variable length data format as usual.

    I  don't know if there's a compression library available that
    fit's our need. First and  most  important  it  must  have  a
    license  that  permits  us  to include it in the distribution
    under our existing license. Second it's  implementation  must
    not  cause any problems in the backend like memory leakage or
    the like.

> >     The compression rates aren't that giantic.  I've  got  30-50%
>
> Not is problem, that your implementation compress all data at once?
> Typically compression use a stream, and compress only small a buffer
> in any cycle.

    No, that's no problem. On type input, the original  value  is
    completely  in  memory  given  as  a  char*, and the internal
    representation is returned as a palloc()'d Datum. For  output
    it's vice versa.

    O.K.  some  details  on  the  compression rate. I've used 112
    .html files with a total size of  1188346  bytes  this  time.
    The  smallest one was 131 bytes, the largest one 114549 bytes
    and most of the files are somewhere between 3-12K.

    Compression results on the binary level are:

        gzip -9 outputs 398180 bytes (66.5% rate)

        gzip -1 outputs 447597 bytes (62.3% rate)

        my code outputs 529420 bytes (55.4% rate)

    Html input might be somewhat  optimal  for  Adisak's  storage
    format,  but  taking into account that my source implementing
    the type input and  output  functions  is  smaller  than  600
    lines,  I  think 11% difference to a gzip -9 is a good result
    anyway.

> >     Sorry  for the compression specific slang here.  Well, anyone
> >     interested in the code?
>
> Yes, for me - I finish to_char()/to_data() ora compatible routines
> (Thomas, you still quiet?) and this is new appeal for me :-)

    Bruce suggested the contrib area, but I'm not sure if  that's
    the right place. If it goes into the distribution at all, I'd
    like to use this data type for rule plan strings and function
    source text in the system catalogs. I don't expect we'll have
    a general solution for tuples split  across  multiple  blocks
    for  v7.0.  And  using  lztext for rules and function sources
    would lower some FRP's. But using it in the catalogs requires
    to be builtin.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: [HACKERS] compression in LO and other fields
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] compression in LO and other fields