Re: [HACKERS] compression in LO and other fields - Mailing list pgsql-hackers

From wieck@debis.com (Jan Wieck)
Subject Re: [HACKERS] compression in LO and other fields
Date
Msg-id m11m7SM-0003kLC@orion.SAPserv.Hamburg.dsh.de
Whole thread Raw
In response to Re: [HACKERS] compression in LO and other fields  (Tatsuo Ishii <t-ishii@sra.co.jp>)
Responses Re: [HACKERS] compression in LO and other fields  (Bruce Momjian <maillist@candle.pha.pa.us>)
Re: [HACKERS] compression in LO and other fields  (Karel Zak - Zakkr <zakkr@zf.jcu.cz>)
List pgsql-hackers
Tatsuo Ishii wrote:

> > LO is a dead end.  What we really want to do is eliminate tuple-size
> > restrictions and then have large ordinary fields (probably of type
> > bytea) in regular tuples.  I'd suggest working on compression in that
> > context, say as a new data type called "bytez" or something like that.
>
> It sounds ideal but I remember that Vadim said inserting a 2GB record
> is not good idea since it will be written into the log too. If it's a
> necessary limitation from the point of view of WAL, we have to accept
> it, I think.

    Just  in case someone want to implement a complete compressed
    data type  (including  comarision  functions,  operators  and
    indexing default operator class).

    I  already  made  some  tests  with  a type I called 'lztext'
    locally.  Only the input-/output-functions exist so  far  and
    as  the  name  might  suggest, it would be an alternative for
    'text'. It uses a simple but fast, byte oriented LZ  backward
    pointing  method.  No  Huffman coding or variable offset/size
    tagging. First byte of a chunk  tells  bitwise  if  the  next
    following  8  items are raw bytes to copy or 12 bit offset, 4
    bit size copy information.  That is max back offset 4096  and
    max match size 17 bytes.

    What   made  it  my  preferred  method  was  the  fact,  that
    decompression is done entirely using the already decompressed
    portion  of  the data, so it does not need any code tables or
    the like at that time.

    It is really FASTEST on decompression, which I  assume  would
    be  the  mostly often used operation on huge data types. With
    some care,  comparision  could  be  done  on  the  fly  while
    decompressing  two values, so that the entire comparision can
    be aborted at the occurence of the first difference.

    The compression rates aren't that giantic.  I've  got  30-50%
    for  rule  plan  strings  (size  limit  on views!!!). And the
    method used only allows for  buffer  back  references  of  4K
    offsets  at  most,  so the rate will not grow for larger data
    chunks. That's a heavy tradeoff between compression rate  and
    no  memory  leakage  for sure and speed, I know, but I prefer
    not to force it, instead I usually use a bigger  hammer  (the
    tuple  size limit is still our original problem - and another
    IBM 72GB disk doing 22-37 MB/s will make any compressing data
    type obsolete then).

    Sorry  for the compression specific slang here.  Well, anyone
    interested in the code?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: [HACKERS] compression in LO and other fields
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] compression in LO and other fields