Thread: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Stephen R. van den Berg"

Date:

04 January 2009, 21:30:45

I asked the author of the QuickLZ algorithm about licensing...
Sounds like he is willing to cooperate.  This is what I got from him:

On Sat, Jan 3, 2009 at 17:56, Lasse Reinhold <lar@quicklz.com> wrote:
> Hi Stephen,
>
> That sounds really exciting, I'd love to see QuickLZ included into
> PostgreSQL. I'd be glad to offer support and add custom optimizations,
> features or hacks or whatever should turn up.
>
> My only concern is to avoid undermining the commercial license, but this
> can, as you suggest, be solved by exceptionally allowing QuickLZ to be
> linked with PostgreSQL. Since I have exclusive copyright of QuickLZ any
> construction is possible.
>
> Greetings,
>
> Lasse Reinhold
> Developer
> http://www.quicklz.com/
> lar@quicklz.com
>
> On Sat Jan 3 15:46 , 'Stephen R. van den Berg' sent:
>
> PostgreSQL is the most advanced Open Source database at this moment, it is
> being distributed under a Berkeley license though.
>
> What if we'd like to use your QuickLZ algorithm in the PostgreSQL core
> to compress rows in the internal archive format (it's not going to be a
> compression algorithm which is exposed to the SQL level)?
> Is it conceivable that you'd allow us to use the algorithm free of charge
> and allow it to be distributed under the Berkeley license, as long as it
> is part of the PostgreSQL backend?
> --
> Sincerely,
> Stephen R. van den Berg.
>
> Expect the unexpected!
> )
>
>



-- 
Sincerely,               Stephen R. van den Berg.

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Alvaro Herrera

Date:

04 January 2009, 21:52:14

> On Sat, Jan 3, 2009 at 17:56, Lasse Reinhold <lar@quicklz.com> wrote:

> > That sounds really exciting, I'd love to see QuickLZ included into
> > PostgreSQL. I'd be glad to offer support and add custom optimizations,
> > features or hacks or whatever should turn up.
> >
> > My only concern is to avoid undermining the commercial license, but this
> > can, as you suggest, be solved by exceptionally allowing QuickLZ to be
> > linked with PostgreSQL. Since I have exclusive copyright of QuickLZ any
> > construction is possible.

Hmm ... keep in mind that PostgreSQL is used as a base for a certain
number of commercial, non-BSD products (Greenplum, Netezza,
EnterpriseDB, Truviso, are the ones that come to mind).  Would this
exception allow for linking QuickLZ with them too?  It doesn't sound to
me like you're open to relicensing it under BSD, which puts us in an
uncomfortable position.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Stephen R. van den Berg"

Date:

05 January 2009, 10:35:31

Alvaro Herrera wrote:
>> On Sat, Jan 3, 2009 at 17:56, Lasse Reinhold <lar@quicklz.com> wrote:

>> > That sounds really exciting, I'd love to see QuickLZ included into
>> > PostgreSQL. I'd be glad to offer support and add custom optimizations,
>> > features or hacks or whatever should turn up.

>> > My only concern is to avoid undermining the commercial license, but this
>> > can, as you suggest, be solved by exceptionally allowing QuickLZ to be
>> > linked with PostgreSQL. Since I have exclusive copyright of QuickLZ any
>> > construction is possible.

>Hmm ... keep in mind that PostgreSQL is used as a base for a certain
>number of commercial, non-BSD products (Greenplum, Netezza,
>EnterpriseDB, Truviso, are the ones that come to mind).  Would this
>exception allow for linking QuickLZ with them too?  It doesn't sound to
>me like you're open to relicensing it under BSD, which puts us in an
>uncomfortable position.

I'm not speaking for Lasse, merely providing food for thought, but it sounds
feasible to me (and conforming to Lasse's spirit of his intended license)
to put something like the following license on his code, which would allow
inclusion into the PostgreSQL codebase and not restrict usage in any
of the derived works:

"Grant license to use the code in question without cost, provided thatthe code is being linked to at least 50% of the
PostgreSQLcode it isbeing distributed alongside with."
 

This should allow commercial reuse in derived products without undesirable
sideeffects.
-- 
Sincerely,          Stephen R. van den Berg.

"Well, if we're going to make a party of it, let's nibble Nobby's nuts!"

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Douglas McNaught"

Date:

05 January 2009, 13:28:50

On Mon, Jan 5, 2009 at 3:18 AM, Stephen R. van den Berg <srb@cuci.nl> wrote:
> I'm not speaking for Lasse, merely providing food for thought, but it sounds
> feasible to me (and conforming to Lasse's spirit of his intended license)
> to put something like the following license on his code, which would allow
> inclusion into the PostgreSQL codebase and not restrict usage in any
> of the derived works:
>
> "Grant license to use the code in question without cost, provided that
>  the code is being linked to at least 50% of the PostgreSQL code it is
>  being distributed alongside with."
>
> This should allow commercial reuse in derived products without undesirable
> sideeffects.

I think Postgres becomes non-DFSG-free if this is done.  All of a
sudden one can't pull arbitrary pieces of code out of PG and use them
in other projects (which I'd argue is the intent if not the letter of
the DFSG).  Have we ever allowed code in on these terms before?  Are
we willing to be dropped from Debian and possibly Red Hat if this is
the case?

-Doug

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Andrew Dunstan

Date:

05 January 2009, 13:40:45


Douglas McNaught wrote:
> On Mon, Jan 5, 2009 at 3:18 AM, Stephen R. van den Berg <srb@cuci.nl> wrote:
>   
>> I'm not speaking for Lasse, merely providing food for thought, but it sounds
>> feasible to me (and conforming to Lasse's spirit of his intended license)
>> to put something like the following license on his code, which would allow
>> inclusion into the PostgreSQL codebase and not restrict usage in any
>> of the derived works:
>>
>> "Grant license to use the code in question without cost, provided that
>>  the code is being linked to at least 50% of the PostgreSQL code it is
>>  being distributed alongside with."
>>
>> This should allow commercial reuse in derived products without undesirable
>> sideeffects.
>>     
>
> I think Postgres becomes non-DFSG-free if this is done.  All of a
> sudden one can't pull arbitrary pieces of code out of PG and use them
> in other projects (which I'd argue is the intent if not the letter of
> the DFSG).  Have we ever allowed code in on these terms before?  Are
> we willing to be dropped from Debian and possibly Red Hat if this is
> the case?
>
>
>   

Presumably a clean room implementation of this algorithm would get us 
over these hurdles, if anyone wants to undertake it.

I certainly agree that we don't want arbitrary bits of our code to be 
encumbered or licensed differently from the rest.

cheers

andrew

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Stefan Kaltenbrunner

Date:

05 January 2009, 13:43:49

Andrew Dunstan wrote:
> 
> 
> Douglas McNaught wrote:
>> On Mon, Jan 5, 2009 at 3:18 AM, Stephen R. van den Berg <srb@cuci.nl> 
>> wrote:
>>  
>>> I'm not speaking for Lasse, merely providing food for thought, but it 
>>> sounds
>>> feasible to me (and conforming to Lasse's spirit of his intended 
>>> license)
>>> to put something like the following license on his code, which would 
>>> allow
>>> inclusion into the PostgreSQL codebase and not restrict usage in any
>>> of the derived works:
>>>
>>> "Grant license to use the code in question without cost, provided that
>>>  the code is being linked to at least 50% of the PostgreSQL code it is
>>>  being distributed alongside with."
>>>
>>> This should allow commercial reuse in derived products without 
>>> undesirable
>>> sideeffects.
>>>     
>>
>> I think Postgres becomes non-DFSG-free if this is done.  All of a
>> sudden one can't pull arbitrary pieces of code out of PG and use them
>> in other projects (which I'd argue is the intent if not the letter of
>> the DFSG).  Have we ever allowed code in on these terms before?  Are
>> we willing to be dropped from Debian and possibly Red Hat if this is
>> the case?
>>
>>
>>   
> 
> Presumably a clean room implementation of this algorithm would get us 
> over these hurdles, if anyone wants to undertake it.
> 
> I certainly agree that we don't want arbitrary bits of our code to be 
> encumbered or licensed differently from the rest.

do we actually have any numbers that quicklz is actually faster and/or 
compresses better than what we have now?


Stefan

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Robert Haas"

Date:

05 January 2009, 14:04:28

> Are
> we willing to be dropped from Debian and possibly Red Hat if this is
> the case?

No.  I frankly think this discussion is a dead end.

The whole thing got started because Alex Hunsaker pointed out that his
database got a lot bigger because we disabled compression on columns >
1MB.  It seems like the obvious thing to do is turn it back on again.
The only objection to that is that it will hurt performance,
especially on substring operations.  That lead to a discussion of
alternative compression algorithms, which is only relevant if we
believe that there are people out there who want to do substring
extractions on huge columns AND want those columns to be compressed.
At least on this thread, we have zero requests for that feature
combination.

What we do have is a suggestion from several people that the database
shouldn't be in the business of compressing data AT ALL.  If we want
to implement that suggestion, then we could change the default column
storage type.

Regardless of whether we do that or not, no one has offered any
justification of the arbitrary decision not to compress columns >1MB,
and at least one person (Peter) has suggested that it is exactly
backwards.  I think he's right, and this part should be backed out.
That will leave us back in the sensible place where people who want
compression can get it (which is currently false) and people who don't
want it can get rid of it (which has always been true).  If there is a
demand for alternative compression algorithms, then someone can submit
a patch for that for 8.5.

...Robert

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Stephen R. van den Berg"

Date:

05 January 2009, 14:17:10

Douglas McNaught wrote:
>> "Grant license to use the code in question without cost, provided that
>>  the code is being linked to at least 50% of the PostgreSQL code it is
>>  being distributed alongside with."

>> This should allow commercial reuse in derived products without undesirable
>> sideeffects.

>I think Postgres becomes non-DFSG-free if this is done.  All of a
>sudden one can't pull arbitrary pieces of code out of PG and use them
>in other projects (which I'd argue is the intent if not the letter of
>the DFSG).  Have we ever allowed code in on these terms before?  Are
>we willing to be dropped from Debian and possibly Red Hat if this is
>the case?

Upon reading the DFSG, it seems you have a point...
However...
QuickLZ is dual licensed:
a. Royalty-free-perpetuous-use as part of the PostgreSQL backend or  any derived works of PostgreSQL which link in *at
least*50% of the  original PostgreSQL codebase.

b. GPL if a) does not apply for some reason.

I.e. for all intents and purposes, it fits the bill for both:
1. PostgreSQL-derived products (existing and future).
2. Debian/RedHat, since the source can be used under GPL.

In essence, it would be kind of a GPL license on steroids; it grants
Berkeley-style rights as long as the source is part of PostgreSQL (or a
derived work thereof), and it falls back to GPL if extracted.
-- 
Sincerely,          Stephen R. van den Berg.

"Well, if we're going to make a party of it, let's nibble Nobby's nuts!"

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Tom Lane

Date:

05 January 2009, 14:17:41

"Douglas McNaught" <doug@mcnaught.org> writes:
> I think Postgres becomes non-DFSG-free if this is done.  All of a
> sudden one can't pull arbitrary pieces of code out of PG and use them
> in other projects (which I'd argue is the intent if not the letter of
> the DFSG).  Have we ever allowed code in on these terms before?

No, and we aren't starting now.  Any submission that's not
BSD-equivalent license will be rejected.  Count on it.
        regards, tom lane

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Andrew Chernow

Date:

05 January 2009, 14:45:13

Robert Haas wrote:
> 
> What we do have is a suggestion from several people that the database
> shouldn't be in the business of compressing data AT ALL.  If we want

+1

IMHO, this is a job for the application.  I also think the current 
implementation is a little odd in that it only compresses data objects 
under a meg.

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"A.M."

Date:

05 January 2009, 14:55:23

On Jan 5, 2009, at 1:16 PM, Stephen R. van den Berg wrote:
>
> Upon reading the DFSG, it seems you have a point...
> However...
> QuickLZ is dual licensed:
> a. Royalty-free-perpetuous-use as part of the PostgreSQL backend or
>   any derived works of PostgreSQL which link in *at least* 50% of the
>   original PostgreSQL codebase.

How does one even define "50% of the original PostgreSQL codebase"?  
What nonsense.

-M

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Gregory Stark

Date:

05 January 2009, 15:02:54

"Robert Haas" <robertmhaas@gmail.com> writes:

> Regardless of whether we do that or not, no one has offered any
> justification of the arbitrary decision not to compress columns >1MB,

Er, yes, there was discussion before the change, for instance:
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php

And do you have any response to this point?
I think the right value for this setting is going to depend on theenvironment. If the system is starved for cpu cycles
thenyou won't want tocompress large data. If it's starved for i/o bandwidth but has spare cpucycles then you will.

http://archives.postgresql.org/pgsql-hackers/2009-01/msg00074.php

> and at least one person (Peter) has suggested that it is exactly
> backwards.  I think he's right, and this part should be backed out.

Well the original code had a threshold above which we *always* compresed even
if it saved only a single byte.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Tom Lane

Date:

05 January 2009, 15:05:56

"Robert Haas" <robertmhaas@gmail.com> writes:
> The whole thing got started because Alex Hunsaker pointed out that his
> database got a lot bigger because we disabled compression on columns >
> 1MB.  It seems like the obvious thing to do is turn it back on again.

I suggest that before we make any knee-jerk responses, we need to go
back and reread the prior discussion.  The current 8.4 code was proposed
here:
http://archives.postgresql.org/pgsql-patches/2008-02/msg00053.php
and that message links to several older threads that were complaining
about the 8.3 behavior.  In particular the notion of an upper limit
on what we should attempt to compress was discussed in this thread:
http://archives.postgresql.org/pgsql-general/2007-08/msg01129.php

After poking around in those threads a bit, I think that the current
threshold of 1MB was something I just made up on the fly (I did note
that it needed tuning...).  Perhaps something like 10MB would be a
better default.  Another possibility is to have different minimum
compression rates for "small" and "large" datums.
        regards, tom lane

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Robert Haas"

Date:

05 January 2009, 15:56:09

On Mon, Jan 5, 2009 at 2:02 PM, Gregory Stark <stark@enterprisedb.com> wrote:
>> Regardless of whether we do that or not, no one has offered any
>> justification of the arbitrary decision not to compress columns >1MB,
>
> Er, yes, there was discussion before the change, for instance:
>
>  http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php

OK, maybe I'm missing something, but I don't see anywhere in that
email where it suggests NEVER compressing anything above 1MB.  It
suggests some more nuanced things which are quite different.

> And do you have any response to this point?
>
>  I think the right value for this setting is going to depend on the
>  environment. If the system is starved for cpu cycles then you won't want to
>  compress large data. If it's starved for i/o bandwidth but has spare cpu
>  cycles then you will.
>
> http://archives.postgresql.org/pgsql-hackers/2009-01/msg00074.php

I think it is a good point, to the extent that compression is an
option that people choose in order to improve performance.  I'm not
really convinced that this is the case, but I haven't seen much
evidence on either side of the question.

> Well the original code had a threshold above which we *always* compresed even
> if it saved only a single byte.

I certainly don't think that's right either.

...Robert

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Holger Hoffstaette"

Date:

05 January 2009, 16:05:26

On Mon, 05 Jan 2009 13:44:57 -0500, Andrew Chernow wrote:

> Robert Haas wrote:
>> 
>> What we do have is a suggestion from several people that the database
>> shouldn't be in the business of compressing data AT ALL.  If we want

DB2 users generally seem very happy with the built-in compression.

> IMHO, this is a job for the application.

Changing applications is several times more expensive and often simply not
possible.

-h

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Joshua D. Drake"

Date:

05 January 2009, 17:02:39

On Mon, 2009-01-05 at 13:04 -0500, Robert Haas wrote:
> > Are
> > we willing to be dropped from Debian and possibly Red Hat if this is
> > the case?

> Regardless of whether we do that or not, no one has offered any
> justification of the arbitrary decision not to compress columns >1MB,
> and at least one person (Peter) has suggested that it is exactly
> backwards.  I think he's right, and this part should be backed out.
> That will leave us back in the sensible place where people who want
> compression can get it (which is currently false) and people who don't
> want it can get rid of it (which has always been true).  If there is a
> demand for alternative compression algorithms, then someone can submit
> a patch for that for 8.5.
> 
> ...Robert

+1

Sincerely,

Joshua D. Drake


> 
-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Mark Mielke

Date:

05 January 2009, 19:19:06

Guaranteed compression of large data fields is the responsibility of the 
client. The database should feel free to compress behind the scenes if 
it turns out to be desirable, but an expectation that it compresses is 
wrong in my opinion.

That said, I'm wondering why compression has to be a problem or why >1 
Mbyte is a reasonable compromise? I missed the original thread that lead 
to 8.4. It seems to me that transparent file system compression doesn't 
have limits like "files must be less than 1 Mbyte to be compressed". 
They don't exhibit poor file system performance. I remember back in the 
386/486 days, that I would always DriveSpace compress everything, 
because hard disks were so slow then that DriveSpace would actually 
increase performance. The toast tables already give a sort of 
block-addressable scheme. Compression can be on a per block or per set 
of blocks basis allowing for seek into the block, or if compression 
doesn't seem to be working for the first few blocks, the later blocks 
can be stored uncompressed? Or is that too complicated compared to what 
we have now? :-)

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Stephen R. van den Berg"

Date:

05 January 2009, 19:21:52

A.M. wrote:
>On Jan 5, 2009, at 1:16 PM, Stephen R. van den Berg wrote:
>>Upon reading the DFSG, it seems you have a point...
>>However...
>>QuickLZ is dual licensed:
>>a. Royalty-free-perpetuous-use as part of the PostgreSQL backend or
>>  any derived works of PostgreSQL which link in *at least* 50% of the
>>  original PostgreSQL codebase.

>How does one even define "50% of the original PostgreSQL codebase"?  
>What nonsense.

It's a suggested (but by no means definitive) technical translation of
the legalese term "substantial".  Substitute with something better, by all
means.
-- 
Sincerely,          Stephen R. van den Berg.

"Well, if we're going to make a party of it, let's nibble Nobby's nuts!"

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Gregory Stark

Date:

05 January 2009, 19:24:28

Mark Mielke <mark@mark.mielke.cc> writes:

> It seems to me that transparent file system compression doesn't have limits
> like "files must be less than 1 Mbyte to be compressed". They don't exhibit
> poor file system performance.

Well I imagine those implementations are more complex than toast is. I'm not
sure what lessons we can learn from their behaviour directly.

> I remember back in the 386/486 days, that I would always DriveSpace compress
> everything, because hard disks were so slow then that DriveSpace would
> actually increase performance.

Surely this depends on whether your machine was cpu starved or disk starved?
Do you happen to recall which camp these anecdotal machines from 1980 fell in?

> The toast tables already give a sort of block-addressable scheme.
> Compression can be on a per block or per set of blocks basis allowing for
> seek into the block,

The current toast architecture is that we compress the whole datum, then store
the datum either inline or using the same external blocking mechanism that we
use when not compressing. So this doesn't fit at all.

It does seem like an interesting idea to have toast chunks which are
compressed individually. So each chunk could be, say, an 8kb chunk of
plaintext and stored as whatever size it ends up being after compression. That
would allow us to do random access into external chunks as well as allow
overlaying the cpu costs of decompression with the i/o costs. It would get a
lower compression ratio than compressing the whole object together but we
would have to experiment to see how big a problem that was.

It would be pretty much rewriting the toast mechanism for external compressed
data though. Currently the storage and the compression are handled separately.
This would tie the two together in a separate code path.

Hm, It occurs to me we could almost use the existing code. Just store it as a
regular uncompressed external datum but allow the toaster to operate on the
data column (which it's normally not allowed to) to compress it, but not store
it externally.

> or if compression doesn't seem to be working for the first few blocks, the
> later blocks can be stored uncompressed? Or is that too complicated compared
> to what we have now? :-)

Actually we do that now, it was part of the same patch we're discussing.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Stephen R. van den Berg"

Date:

05 January 2009, 19:25:33

Tom Lane wrote:
>"Robert Haas" <robertmhaas@gmail.com> writes:
>> The whole thing got started because Alex Hunsaker pointed out that his
>> database got a lot bigger because we disabled compression on columns >
>> 1MB.  It seems like the obvious thing to do is turn it back on again.

>After poking around in those threads a bit, I think that the current
>threshold of 1MB was something I just made up on the fly (I did note
>that it needed tuning...).  Perhaps something like 10MB would be a
>better default.  Another possibility is to have different minimum
>compression rates for "small" and "large" datums.

As far as I can imagine, the following use cases apply:
a. Columnsize <= 2048 bytes without substring access.
b. Columnsize <= 2048 bytes with substring access.
c. Columnsize  > 2048 bytes compressible without substring access (text).
d. Columnsize  > 2048 bytes uncompressible with substring access (multimedia).

Can anyone think of another use case I missed here?

To cover those cases, the following solutions seem feasible:
Sa. Disable compression for this column (manually, by the DBA).
Sb. Check if the compression saves more than 20%, store uncompressed otherwise.
Sc. Check if the compression saves more than 20%, store uncompressed otherwise.
Sd. Check if the compression saves more than 20%, store uncompressed otherwise.

For Sb, Sc and Sd we should probably only check the first 256KB or so to
determine the expected savings.
-- 
Sincerely,          Stephen R. van den Berg.

"Well, if we're going to make a party of it, let's nibble Nobby's nuts!"

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Tom Lane

Date:

05 January 2009, 20:36:11

Gregory Stark <stark@enterprisedb.com> writes:
> Hm, It occurs to me we could almost use the existing code. Just store it as a
> regular uncompressed external datum but allow the toaster to operate on the
> data column (which it's normally not allowed to) to compress it, but not store
> it externally.

Yeah, it would be very easy to do that, but the issue then would be that
instead of having a lot of toast-chunk rows that are all carefully made
to fit exactly 4 to a page, you have a lot of toast-chunk rows of
varying size, and you are certainly going to waste some disk space due
to not being able to pack pages full.  In the worst case you'd end up
with zero benefit from compression anyway.  As an example, if all of
your 2K chunks compress by just under 20%, you get no savings because
you can't quite fit 5 to a page.  You'd need an average compression rate
more than 20% to get savings.

We could improve that figure by making the chunk size smaller, but that
carries its own performance penalties (more seeks to fetch all of a
toasted value).  Also, the smaller the chunks the worse the compression
will get.

It's an interesting idea, and would be easy to try so I hope someone
does test it out and see what happens.  But I'm not expecting miracles.

I think a more realistic approach would be the one somebody suggested
upthread: split large values into say 1MB segments that are compressed
separately and then stored to TOAST separately.  Substring fetches then
pay the overhead of decompressing 1MB segments that they might need only
part of, but at least they're not pulling out the whole gosh-darn value.
As long as the segment size isn't tiny, the added storage inefficiency
should be pretty minimal.

(How we'd ever do upgrade-in-place to any new compression scheme is an
interesting question too...)
        regards, tom lane

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Mark Mielke

Date:

05 January 2009, 21:42:04

Gregory Stark wrote: <blockquote cite="mid:871vvhjqv5.fsf@oxford.xeocode.com" type="cite"><pre wrap="">Mark Mielke <a
class="moz-txt-link-rfc2396E"href="mailto:mark@mark.mielke.cc"><mark@mark.mielke.cc></a> writes:
</pre><blockquotetype="cite"><pre wrap="">It seems to me that transparent file system compression doesn't have limits
 
like "files must be less than 1 Mbyte to be compressed". They don't exhibit
poor file system performance.   </pre></blockquote><pre wrap="">
Well I imagine those implementations are more complex than toast is. I'm not
sure what lessons we can learn from their behaviour directly. </pre><blockquote type="cite"><pre wrap="">I remember
backin the 386/486 days, that I would always DriveSpace compress
 
everything, because hard disks were so slow then that DriveSpace would
actually increase performance.   </pre></blockquote><pre wrap="">
Surely this depends on whether your machine was cpu starved or disk starved?
Do you happen to recall which camp these anecdotal machines from 1980 fell in? </pre></blockquote><br /> I agree. I'm
sureit was disk I/O starved - and maybe not just the disk. The motherboard might have contributed. :-)<br /><br /> My
productionmachine in 2008/2009 for my uses still seems I/O bound. The main database server I use is 2 x Intel Xeon 3.0
Ghz(dual-core) = 4 cores, and the uptime load average for the whole system is currently 0.10. The database and web
serveruse their own 4 drives with RAID 10 (main system is on two other drives). Yes, I could always upgrade to a
fancy/largerRAID array, SAS, 15k RPM drives, etc. but if a PostgreSQL tweak were to give me 30% more performance at a
15%CPU cost... I think that would be a great alternative option. :-)<br /><br /> Memory may also play a part. My server
athome has 4Mbytes of L2 cache and 4Gbytes of RAM running with 5-5-5-18 DDR2 at 1000Mhz. At these speeds, my realized
bandwidthfor RAM is 6.0+ Gbyte/s. My L1/L2 operate at 10.0+ Gbyte/s. Compression doesn't run that fast, so at least for
me,the benefit of having something in L1/L2 cache vs RAM isn't great, however, my disks in the RAID10 configuraiton
onlyread/write at ~150Mbyte/s sustained, and much less if seeking is required. Compressing the data means 30% more data
mayfit into RAM or 30% increase in data read from disk, as I assume many compression algorithms can beat 150
Mbyte/s.<br/><br /> Is my configuration typical? It's probably becoming more so. Certainly more common than the 10+
diskhardware RAID configurations.<br /><br /><br /><blockquote cite="mid:871vvhjqv5.fsf@oxford.xeocode.com"
type="cite"><blockquotetype="cite"><pre wrap="">The toast tables already give a sort of block-addressable scheme.
 
Compression can be on a per block or per set of blocks basis allowing for
seek into the block,   </pre></blockquote><pre wrap="">
The current toast architecture is that we compress the whole datum, then store
the datum either inline or using the same external blocking mechanism that we
use when not compressing. So this doesn't fit at all.
It does seem like an interesting idea to have toast chunks which are
compressed individually. So each chunk could be, say, an 8kb chunk of
plaintext and stored as whatever size it ends up being after compression. That
would allow us to do random access into external chunks as well as allow
overlaying the cpu costs of decompression with the i/o costs. It would get a
lower compression ratio than compressing the whole object together but we
would have to experiment to see how big a problem that was.

It would be pretty much rewriting the toast mechanism for external compressed
data though. Currently the storage and the compression are handled separately.
This would tie the two together in a separate code path.

Hm, It occurs to me we could almost use the existing code. Just store it as a
regular uncompressed external datum but allow the toaster to operate on the
data column (which it's normally not allowed to) to compress it, but not store
it externally. </pre></blockquote> Yeah - sounds like it could be messy.<br /><br /><blockquote
cite="mid:871vvhjqv5.fsf@oxford.xeocode.com"type="cite"><blockquote type="cite"><pre wrap="">or if compression doesn't
seemto be working for the first few blocks, the
 
later blocks can be stored uncompressed? Or is that too complicated compared
to what we have now? :-)   </pre></blockquote><pre wrap="">
Actually we do that now, it was part of the same patch we're discussing. </pre></blockquote><br /> Cheers,<br />
mark<br/><br /><pre class="moz-signature" cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Andrew Chernow

Date:

05 January 2009, 22:01:19

Holger Hoffstaette wrote:
> On Mon, 05 Jan 2009 13:44:57 -0500, Andrew Chernow wrote:
> 
>> Robert Haas wrote:
>>> What we do have is a suggestion from several people that the database
>>> shouldn't be in the business of compressing data AT ALL.  If we want
> 
> DB2 users generally seem very happy with the built-in compression.
> 
>> IMHO, this is a job for the application.
> 
> Changing applications is several times more expensive and often simply not
> possible.
> 
> 

The database can still handle all of the compression requirements if the 
"application" creates a couple user-defined functions (probably in c) 
that utilize one of the many existing compression libraries (hand picked 
for their needs).  You can use them in triggers to make it transparent.  You can use them directly in statements.  You
cancontrol selecting 

the data compressed or uncomrpessed, which is a valid use case for 
remote clients that have to download a large bytea or text.  You can 
toggle compression algorithms and settings dependant on $whatever.

You can do this all of this right w/o the built-in compression, which is 
my point; why have the built-in compression at all?  Seems like a 
home-cut solution provides more features and control with minimal 
engineering.  All the real engineering is done: the database and 
compression libraries.  All that's left are a few glue functions in C.

Well, my two pennies :)

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Robert Haas"

Date:

05 January 2009, 22:40:07

> I suggest that before we make any knee-jerk responses, we need to go
> back and reread the prior discussion.
> http://archives.postgresql.org/pgsql-patches/2008-02/msg00053.php
> and that message links to several older threads that were complaining
> about the 8.3 behavior.  In particular the notion of an upper limit
> on what we should attempt to compress was discussed in this thread:
> http://archives.postgresql.org/pgsql-general/2007-08/msg01129.php

Thanks for the pointers.

> After poking around in those threads a bit, I think that the current
> threshold of 1MB was something I just made up on the fly (I did note
> that it needed tuning...).  Perhaps something like 10MB would be a
> better default.  Another possibility is to have different minimum
> compression rates for "small" and "large" datums.

After reading these discussions, I guess I still don't understand why
we would treat small and large datums differently.  It seems to me
that you had it about right here:

http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php

# Or maybe it should just be a min_comp_rate and nothing else.
# Compressing a 1GB field to 999MB is probably not very sane either.

I agree with that.  force_input_size doesn't seem like a good idea
because compression can be useless on big datums just as it can be on
little ones - the obvious case being media file formats that are
already internally compressed.  Even if you can squeeze a little more
out, you're using a lot of CPU time for a very small gain in storage
and/or I/O.  Furthermore, on a large object, saving even 1MB is not
very significant if the datum is 1GB in size - so, again, a percentage
seems like the right thing.

On the other hand, even after reading these threads, I still don't see
any need to disable compression for large datums.  I can't think of
any reason why I would want to try compressing a 900kB object but not
1MB one.  It makes sense to me to not compress if the object doesn't
compress well, or if some initial segment of the object doesn't
compress well (say, if we can't squeeze 10% out of the first 64kB),
but size by itself doesn't seem significant.

To put that another way, if small objects and large objects are to be
treated differently, which one will we try harder to compress and why?Greg Stark makes an argument that we should try
harderwhen it might
 
avoid the need for a toast table:

http://archives.postgresql.org/pgsql-hackers/2007-08/msg00087.php

...which has some merit, though clearly it would be a lot better if we
could do it when, and only when, it was actually going to work.  Also,
not compressing very small datums (< 256 bytes) also seems smart,
since that could end up producing a lot of extra compression attempts,
most of which will end up saving little or no space.

Apart from those two cases I don't see any clear motivation for
discriminating on size.

...Robert

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Bruce Momjian

Date:

06 January 2009, 00:07:23

Robert Haas wrote:
> > After poking around in those threads a bit, I think that the current
> > threshold of 1MB was something I just made up on the fly (I did note
> > that it needed tuning...).  Perhaps something like 10MB would be a
> > better default.  Another possibility is to have different minimum
> > compression rates for "small" and "large" datums.
> 
> After reading these discussions, I guess I still don't understand why
> we would treat small and large datums differently.  It seems to me
> that you had it about right here:
> 
> http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php
> 
> # Or maybe it should just be a min_comp_rate and nothing else.
> # Compressing a 1GB field to 999MB is probably not very sane either.
> 
> I agree with that.  force_input_size doesn't seem like a good idea
> because compression can be useless on big datums just as it can be on
> little ones - the obvious case being media file formats that are
> already internally compressed.  Even if you can squeeze a little more
> out, you're using a lot of CPU time for a very small gain in storage
> and/or I/O.  Furthermore, on a large object, saving even 1MB is not
> very significant if the datum is 1GB in size - so, again, a percentage
> seems like the right thing.
> 
> On the other hand, even after reading these threads, I still don't see
> any need to disable compression for large datums.  I can't think of
> any reason why I would want to try compressing a 900kB object but not
> 1MB one.  It makes sense to me to not compress if the object doesn't
> compress well, or if some initial segment of the object doesn't
> compress well (say, if we can't squeeze 10% out of the first 64kB),
> but size by itself doesn't seem significant.
> 
> To put that another way, if small objects and large objects are to be
> treated differently, which one will we try harder to compress and why?
>  Greg Stark makes an argument that we should try harder when it might
> avoid the need for a toast table:
> 
> http://archives.postgresql.org/pgsql-hackers/2007-08/msg00087.php
> 
> ...which has some merit, though clearly it would be a lot better if we
> could do it when, and only when, it was actually going to work.  Also,
> not compressing very small datums (< 256 bytes) also seems smart,
> since that could end up producing a lot of extra compression attempts,
> most of which will end up saving little or no space.
> 
> Apart from those two cases I don't see any clear motivation for
> discriminating on size.

Agreed.  I have seen a lot of discussion on this topic and the majority
seems to feel that a size limit on compress doesn't make sense in the
general case.  It is true that there is dimminished performance for
substring operations as TOAST values get longer but compression does
give better performance for longer values for full field retrieval.  I
don't think we should be optimizing TOAST for substrings --- users who
know they are going to be using substrings can specify the storage type
for the column directly.  Having any kind of maximum makes it hard for
administrators to know exactly what is happening in TOAST.

I think the upper limit should be removed with a documentation mention
in the substring() section mentioning the use of non-compressed TOAST
storage.  The only way I think an upper compression limit makes sense is
if the backend can't uncompress the value to return it to the user, but
then you have to wonder how the value got into the database in the first
place.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Tom Lane

Date:

06 January 2009, 00:58:38

"Robert Haas" <robertmhaas@gmail.com> writes:
> After reading these discussions, I guess I still don't understand why
> we would treat small and large datums differently.  It seems to me
> that you had it about right here:
> http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php
> # Or maybe it should just be a min_comp_rate and nothing else.
> # Compressing a 1GB field to 999MB is probably not very sane either.

Well, that's okay with me.  I think that the other discussion was mainly
focused on the silliness of compressing large datums when only a small
percentage could be saved.

What we might do for the moment is just to set the upper limit to
INT_MAX in the default strategy, rather than rip out the logic
altogether.  IIRC that limit is checked only once per compression,
not in the inner loop, so it won't cost us any noticeable performance
to leave the logic there in case someone finds a use for it.

> not compressing very small datums (< 256 bytes) also seems smart,
> since that could end up producing a lot of extra compression attempts,
> most of which will end up saving little or no space.

But note that the current code will usually not try to do that anyway,
at least for rows of ordinary numbers of columns.

The present code has actually reduced the lower-bound threshold from
where it used to be.  I think that if anyone wants to argue for a
different value, it'd be time to whip out some actual tests ;-).
We can't set specific parameter values from gedanken-experiments.
        regards, tom lane

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Gregory Stark

Date:

06 January 2009, 03:47:35

> "Robert Haas" <robertmhaas@gmail.com> writes:
>
>> not compressing very small datums (< 256 bytes) also seems smart,
>> since that could end up producing a lot of extra compression attempts,
>> most of which will end up saving little or no space.

That was presumably the rationale for the original logic. However experience
shows that there are certainly databases that store a lot of compressible
short strings. 

Obviously databases with CHAR(n) desperately need us to compress them. But
even plain text data are often moderately compressible even with our fairly
weak compression algorithm.

One other thing that bothers me about our toast mechanism is that it only
kicks in for tuples that are "too large". It seems weird that the same column
is worth compressing or not depending on what other columns are in the same
tuple.

If you store a 2000 byte tuple that's all spaces we don't try to compress it
at all. But if you added one more attribute we would go to great lengths
compressing and storing attributes externally -- not necessarily the attribute
you just added, the ones that were perfectly fine previously -- to try to get
it under 2k.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

LasseReinhold

Date:

06 January 2009, 06:30:14


Stephen R. van den Berg wrote:
> 
> I asked the author of the QuickLZ algorithm about licensing...
> Sounds like he is willing to cooperate.  This is what I got from him:
> 
> On Sat, Jan 3, 2009 at 17:56, Lasse Reinhold <lar@quicklz.com> wrote:
>> Hi Stephen,
>>
>> That sounds really exciting, I'd love to see QuickLZ included into
>> PostgreSQL. I'd be glad to offer support and add custom optimizations,
>> features or hacks or whatever should turn up.
>>
>> My only concern is to avoid undermining the commercial license, but this
>> can, as you suggest, be solved by exceptionally allowing QuickLZ to be
>> linked with PostgreSQL. Since I have exclusive copyright of QuickLZ any
>> construction is possible.
> 

Another solution could be to make PostgreSQL prepared for using compression
with QuickLZ, letting the end user download QuickLZ separately and enable it
with a compiler flag during compilation.





-- 
View this message in context:
http://www.nabble.com/QuickLZ-compression-algorithm-%28Re%3A-Inclusion-in-the-PostgreSQL-backend-for-toasting-rows%29-tp21284024p21307987.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

"Robert Haas"

Date:

06 January 2009, 10:57:25

>>> not compressing very small datums (< 256 bytes) also seems smart,
>>> since that could end up producing a lot of extra compression attempts,
>>> most of which will end up saving little or no space.
>
> That was presumably the rationale for the original logic. However experience
> shows that there are certainly databases that store a lot of compressible
> short strings.
>
> Obviously databases with CHAR(n) desperately need us to compress them. But
> even plain text data are often moderately compressible even with our fairly
> weak compression algorithm.
>
> One other thing that bothers me about our toast mechanism is that it only
> kicks in for tuples that are "too large". It seems weird that the same column
> is worth compressing or not depending on what other columns are in the same
> tuple.

That's a fair point.  There's definitely some inconsistency in the
current behavior.  It seems to me that, in theory, compression and
out-of-line storage are two separate behaviors.  Out-of-line storage
is pretty much a requirement for dealing with large objects, given
that the page size is a constant; compression is not a requirement,
but definitely beneficial under some circumstances, particularly when
it removes the need for out-of-line storage.

char(n) is kind of a wierd case because you could also compress by
storing a count of the trailing spaces, without applying a
general-purpose compression algorithm.  But either way the field is no
longer fixed-width, and therefore field access can't be done as a
simple byte offset from the start of the tuple.

It's difficult even to enumerate the possible use cases, let alone
what knobs would be needed to cater to all of them.

...Robert

Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)

From

Alvaro Herrera

Date:

06 January 2009, 11:12:55

Robert Haas escribió:

> char(n) is kind of a wierd case because you could also compress by
> storing a count of the trailing spaces, without applying a
> general-purpose compression algorithm.  But either way the field is no
> longer fixed-width, and therefore field access can't be done as a
> simple byte offset from the start of the tuple.

That's not the case anyway (fixed byte width) due to possible multibyte
chars.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.