Re: [HACKERS] compression in LO and other fields - Mailing list pgsql-hackers

From wieck@debis.com (Jan Wieck)
Subject Re: [HACKERS] compression in LO and other fields
Date
Msg-id m11mIp4-0003kLC@orion.SAPserv.Hamburg.dsh.de
Whole thread Raw
In response to Re: [HACKERS] compression in LO and other fields  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] compression in LO and other fields  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane wrote:

> wieck@debis.com (Jan Wieck) writes:
>
> >     But  it requires decompression of every tuple into palloc()'d
> >     memory during heap access. AFAIK, the  heap  access  routines
> >     currently  return  a  pointer  to  the  tuple  inside the shm
> >     buffer. Don't know what it's performance impact would be.
>
> Good point, but the same will be needed when a tuple is split across
> multiple blocks.  I would expect that (given a reasonably fast
> decompressor) there will be a net performance *gain* due to having
> less disk I/O to do.  Also, this won't be happening for "every" tuple,
> just those exceeding a size threshold --- we'd be able to tune the
> threshold value to trade off speed and space.

    Right,  this  time  it's your good point. All of the problems
    will be there on tuple split implementation.

    The major problem I see is that a palloc()'d tuple should  be
    pfree()'d  after  the fetcher is done with it. Since they are
    in buffer actually, the fetcher doesn't have to care.

> One thing that does occur to me is that we need to store the
> uncompressed as well as the compressed data size, so that the
> working space can be palloc'd before starting the decompression.

    Yepp - and I'm doing so. Only during compression  the  result
    size  isn't known. But there is a well known maximum, that is
    the header overhead plus the data size by 1.125 plus 2  bytes
    (totally  worst  case  on uncompressable data). And a general
    mechanism working on the tuple level would fallback to  store
    uncompressed  data in the case the compressed size is bigger.

> Also, in case it wasn't clear, I was envisioning leaving the tuple
> header uncompressed, so that time quals etc can be checked before
> decompressing the tuple data.

    Of course.

    Well, you asked for the rates on the smaller html files only.
    78  files,  131  bytes  min, 10000 bytes max, 4582 bytes avg,
    357383 bytes total.

    gzip -9 outputs 145659 bytes (59.2%)
    gzip -1 outputs 155113 bytes (56.6%)
    my code outputs 184109 bytes (48.5%)

    67 files, 2000 bytes min, 10000 bytes max,  5239  bytes  avg,
    351006 bytes total.

    gzip -9 outputs 141772 bytes (59.6%)
    gzip -1 outputs 151150 bytes (56.9%)
    my code outputs 179428 bytes (48.9%)

    The  threshold will surely be a tuning parameter of interest.
    Another tuning option must be to allow/deny  compression  per
    table  at  all.   Then  we  could  have both options, using a
    compressing field type to define which portion of a tuple  to
    compress, or allow to compress the entire tuples.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] union problem version 6.5.3
Next
From: "Ross J. Reedstrom"
Date:
Subject: Re: [HACKERS] compression in LO and other fields