Re: [HACKERS] compression in LO and other fields - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] compression in LO and other fields
Date
Msg-id 26512.942417843@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] compression in LO and other fields  (wieck@debis.com (Jan Wieck))
Responses Re: [HACKERS] compression in LO and other fields  (wieck@debis.com (Jan Wieck))
Re: [HACKERS] compression in LO and other fields  (Bruce Momjian <maillist@candle.pha.pa.us>)
List pgsql-hackers
wieck@debis.com (Jan Wieck) writes:
>     Html input might be somewhat  optimal  for  Adisak's  storage
>     format,  but  taking into account that my source implementing
>     the type input and  output  functions  is  smaller  than  600
>     lines,  I  think 11% difference to a gzip -9 is a good result
>     anyway.

These strike me as very good results.  I'm not at all sure that using
gzip or bzip would give much better results in practice in Postgres,
because those compressors are optimized for relatively large files,
whereas a compressed-field datatype would likely be getting relatively
small field values to work on.  (So your test data set is probably a
good one for our purposes --- do the numbers change if you exclude
all the files over, say, 10K?)

>     Bruce suggested the contrib area, but I'm not sure if  that's
>     the right place. If it goes into the distribution at all, I'd
>     like to use this data type for rule plan strings and function
>     source text in the system catalogs.

Right, if we are going to bother with it at all, we should put it
into the core so that we can use it for rule plans.

>     I don't expect we'll have
>     a general solution for tuples split  across  multiple  blocks
>     for  v7.0.

I haven't given up hope of that yet --- but even if we do, compressing
the data is an attractive choice to reduce the frequency with which
tuples must be split across blocks.


It occurred to me last night that applying compression to individual
fields might not be the best approach.  Certainly a "bytez" data type
is the easiest thing to fit into the existing system, but it's leaving
some space savings on the table.  What about compressing the *whole*
data contents of a tuple on-disk, as a single entity?  That should save
more space than field-by-field compression.  It could be triggered in
the tuple storage routines whenever the uncompressed size exceeds some
threshold.  (We'd need a flag in the tuple header to indicate compressed
data, but I think there are bits to spare.)  When we get around to
having split tuples, the code would still be useful because it'd be
applied as a first resort before splitting a large tuple; it'd reduce
the frequency of splits and the number of sections big tuples get split
into.  All automatic and transparent, too --- the user doesn't have to
change data declarations at all.

Also, if we do it that way, then it would *automatically* apply to
both regular tuples and LO, because the current LO implementation is
just tuples.  (Tatsuo's idea of a non-transaction-controlled LO would
need extra work, of course, if we decide that's a good idea...)
        regards, tom lane


pgsql-hackers by date:

Previous
From: wieck@debis.com (Jan Wieck)
Date:
Subject: Re: [HACKERS] compression in LO and other fields
Next
From: Thomas Lockhart
Date:
Subject: Re: internationalizing and etc..