Thread: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Stephen R. van den Berg"
Date:
I asked the author of the QuickLZ algorithm about licensing... Sounds like he is willing to cooperate. This is what I got from him: On Sat, Jan 3, 2009 at 17:56, Lasse Reinhold <lar@quicklz.com> wrote: > Hi Stephen, > > That sounds really exciting, I'd love to see QuickLZ included into > PostgreSQL. I'd be glad to offer support and add custom optimizations, > features or hacks or whatever should turn up. > > My only concern is to avoid undermining the commercial license, but this > can, as you suggest, be solved by exceptionally allowing QuickLZ to be > linked with PostgreSQL. Since I have exclusive copyright of QuickLZ any > construction is possible. > > Greetings, > > Lasse Reinhold > Developer > http://www.quicklz.com/ > lar@quicklz.com > > On Sat Jan 3 15:46 , 'Stephen R. van den Berg' sent: > > PostgreSQL is the most advanced Open Source database at this moment, it is > being distributed under a Berkeley license though. > > What if we'd like to use your QuickLZ algorithm in the PostgreSQL core > to compress rows in the internal archive format (it's not going to be a > compression algorithm which is exposed to the SQL level)? > Is it conceivable that you'd allow us to use the algorithm free of charge > and allow it to be distributed under the Berkeley license, as long as it > is part of the PostgreSQL backend? > -- > Sincerely, > Stephen R. van den Berg. > > Expect the unexpected! > ) > > -- Sincerely, Stephen R. van den Berg.
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Alvaro Herrera
Date:
> On Sat, Jan 3, 2009 at 17:56, Lasse Reinhold <lar@quicklz.com> wrote: > > That sounds really exciting, I'd love to see QuickLZ included into > > PostgreSQL. I'd be glad to offer support and add custom optimizations, > > features or hacks or whatever should turn up. > > > > My only concern is to avoid undermining the commercial license, but this > > can, as you suggest, be solved by exceptionally allowing QuickLZ to be > > linked with PostgreSQL. Since I have exclusive copyright of QuickLZ any > > construction is possible. Hmm ... keep in mind that PostgreSQL is used as a base for a certain number of commercial, non-BSD products (Greenplum, Netezza, EnterpriseDB, Truviso, are the ones that come to mind). Would this exception allow for linking QuickLZ with them too? It doesn't sound to me like you're open to relicensing it under BSD, which puts us in an uncomfortable position. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Stephen R. van den Berg"
Date:
Alvaro Herrera wrote: >> On Sat, Jan 3, 2009 at 17:56, Lasse Reinhold <lar@quicklz.com> wrote: >> > That sounds really exciting, I'd love to see QuickLZ included into >> > PostgreSQL. I'd be glad to offer support and add custom optimizations, >> > features or hacks or whatever should turn up. >> > My only concern is to avoid undermining the commercial license, but this >> > can, as you suggest, be solved by exceptionally allowing QuickLZ to be >> > linked with PostgreSQL. Since I have exclusive copyright of QuickLZ any >> > construction is possible. >Hmm ... keep in mind that PostgreSQL is used as a base for a certain >number of commercial, non-BSD products (Greenplum, Netezza, >EnterpriseDB, Truviso, are the ones that come to mind). Would this >exception allow for linking QuickLZ with them too? It doesn't sound to >me like you're open to relicensing it under BSD, which puts us in an >uncomfortable position. I'm not speaking for Lasse, merely providing food for thought, but it sounds feasible to me (and conforming to Lasse's spirit of his intended license) to put something like the following license on his code, which would allow inclusion into the PostgreSQL codebase and not restrict usage in any of the derived works: "Grant license to use the code in question without cost, provided thatthe code is being linked to at least 50% of the PostgreSQLcode it isbeing distributed alongside with." This should allow commercial reuse in derived products without undesirable sideeffects. -- Sincerely, Stephen R. van den Berg. "Well, if we're going to make a party of it, let's nibble Nobby's nuts!"
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Douglas McNaught"
Date:
On Mon, Jan 5, 2009 at 3:18 AM, Stephen R. van den Berg <srb@cuci.nl> wrote: > I'm not speaking for Lasse, merely providing food for thought, but it sounds > feasible to me (and conforming to Lasse's spirit of his intended license) > to put something like the following license on his code, which would allow > inclusion into the PostgreSQL codebase and not restrict usage in any > of the derived works: > > "Grant license to use the code in question without cost, provided that > the code is being linked to at least 50% of the PostgreSQL code it is > being distributed alongside with." > > This should allow commercial reuse in derived products without undesirable > sideeffects. I think Postgres becomes non-DFSG-free if this is done. All of a sudden one can't pull arbitrary pieces of code out of PG and use them in other projects (which I'd argue is the intent if not the letter of the DFSG). Have we ever allowed code in on these terms before? Are we willing to be dropped from Debian and possibly Red Hat if this is the case? -Doug
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Andrew Dunstan
Date:
Douglas McNaught wrote: > On Mon, Jan 5, 2009 at 3:18 AM, Stephen R. van den Berg <srb@cuci.nl> wrote: > >> I'm not speaking for Lasse, merely providing food for thought, but it sounds >> feasible to me (and conforming to Lasse's spirit of his intended license) >> to put something like the following license on his code, which would allow >> inclusion into the PostgreSQL codebase and not restrict usage in any >> of the derived works: >> >> "Grant license to use the code in question without cost, provided that >> the code is being linked to at least 50% of the PostgreSQL code it is >> being distributed alongside with." >> >> This should allow commercial reuse in derived products without undesirable >> sideeffects. >> > > I think Postgres becomes non-DFSG-free if this is done. All of a > sudden one can't pull arbitrary pieces of code out of PG and use them > in other projects (which I'd argue is the intent if not the letter of > the DFSG). Have we ever allowed code in on these terms before? Are > we willing to be dropped from Debian and possibly Red Hat if this is > the case? > > > Presumably a clean room implementation of this algorithm would get us over these hurdles, if anyone wants to undertake it. I certainly agree that we don't want arbitrary bits of our code to be encumbered or licensed differently from the rest. cheers andrew
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Stefan Kaltenbrunner
Date:
Andrew Dunstan wrote: > > > Douglas McNaught wrote: >> On Mon, Jan 5, 2009 at 3:18 AM, Stephen R. van den Berg <srb@cuci.nl> >> wrote: >> >>> I'm not speaking for Lasse, merely providing food for thought, but it >>> sounds >>> feasible to me (and conforming to Lasse's spirit of his intended >>> license) >>> to put something like the following license on his code, which would >>> allow >>> inclusion into the PostgreSQL codebase and not restrict usage in any >>> of the derived works: >>> >>> "Grant license to use the code in question without cost, provided that >>> the code is being linked to at least 50% of the PostgreSQL code it is >>> being distributed alongside with." >>> >>> This should allow commercial reuse in derived products without >>> undesirable >>> sideeffects. >>> >> >> I think Postgres becomes non-DFSG-free if this is done. All of a >> sudden one can't pull arbitrary pieces of code out of PG and use them >> in other projects (which I'd argue is the intent if not the letter of >> the DFSG). Have we ever allowed code in on these terms before? Are >> we willing to be dropped from Debian and possibly Red Hat if this is >> the case? >> >> >> > > Presumably a clean room implementation of this algorithm would get us > over these hurdles, if anyone wants to undertake it. > > I certainly agree that we don't want arbitrary bits of our code to be > encumbered or licensed differently from the rest. do we actually have any numbers that quicklz is actually faster and/or compresses better than what we have now? Stefan
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Robert Haas"
Date:
> Are > we willing to be dropped from Debian and possibly Red Hat if this is > the case? No. I frankly think this discussion is a dead end. The whole thing got started because Alex Hunsaker pointed out that his database got a lot bigger because we disabled compression on columns > 1MB. It seems like the obvious thing to do is turn it back on again. The only objection to that is that it will hurt performance, especially on substring operations. That lead to a discussion of alternative compression algorithms, which is only relevant if we believe that there are people out there who want to do substring extractions on huge columns AND want those columns to be compressed. At least on this thread, we have zero requests for that feature combination. What we do have is a suggestion from several people that the database shouldn't be in the business of compressing data AT ALL. If we want to implement that suggestion, then we could change the default column storage type. Regardless of whether we do that or not, no one has offered any justification of the arbitrary decision not to compress columns >1MB, and at least one person (Peter) has suggested that it is exactly backwards. I think he's right, and this part should be backed out. That will leave us back in the sensible place where people who want compression can get it (which is currently false) and people who don't want it can get rid of it (which has always been true). If there is a demand for alternative compression algorithms, then someone can submit a patch for that for 8.5. ...Robert
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Stephen R. van den Berg"
Date:
Douglas McNaught wrote: >> "Grant license to use the code in question without cost, provided that >> the code is being linked to at least 50% of the PostgreSQL code it is >> being distributed alongside with." >> This should allow commercial reuse in derived products without undesirable >> sideeffects. >I think Postgres becomes non-DFSG-free if this is done. All of a >sudden one can't pull arbitrary pieces of code out of PG and use them >in other projects (which I'd argue is the intent if not the letter of >the DFSG). Have we ever allowed code in on these terms before? Are >we willing to be dropped from Debian and possibly Red Hat if this is >the case? Upon reading the DFSG, it seems you have a point... However... QuickLZ is dual licensed: a. Royalty-free-perpetuous-use as part of the PostgreSQL backend or any derived works of PostgreSQL which link in *at least*50% of the original PostgreSQL codebase. b. GPL if a) does not apply for some reason. I.e. for all intents and purposes, it fits the bill for both: 1. PostgreSQL-derived products (existing and future). 2. Debian/RedHat, since the source can be used under GPL. In essence, it would be kind of a GPL license on steroids; it grants Berkeley-style rights as long as the source is part of PostgreSQL (or a derived work thereof), and it falls back to GPL if extracted. -- Sincerely, Stephen R. van den Berg. "Well, if we're going to make a party of it, let's nibble Nobby's nuts!"
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Tom Lane
Date:
"Douglas McNaught" <doug@mcnaught.org> writes: > I think Postgres becomes non-DFSG-free if this is done. All of a > sudden one can't pull arbitrary pieces of code out of PG and use them > in other projects (which I'd argue is the intent if not the letter of > the DFSG). Have we ever allowed code in on these terms before? No, and we aren't starting now. Any submission that's not BSD-equivalent license will be rejected. Count on it. regards, tom lane
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Andrew Chernow
Date:
Robert Haas wrote: > > What we do have is a suggestion from several people that the database > shouldn't be in the business of compressing data AT ALL. If we want +1 IMHO, this is a job for the application. I also think the current implementation is a little odd in that it only compresses data objects under a meg. -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"A.M."
Date:
On Jan 5, 2009, at 1:16 PM, Stephen R. van den Berg wrote: > > Upon reading the DFSG, it seems you have a point... > However... > QuickLZ is dual licensed: > a. Royalty-free-perpetuous-use as part of the PostgreSQL backend or > any derived works of PostgreSQL which link in *at least* 50% of the > original PostgreSQL codebase. How does one even define "50% of the original PostgreSQL codebase"? What nonsense. -M
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Gregory Stark
Date:
"Robert Haas" <robertmhaas@gmail.com> writes: > Regardless of whether we do that or not, no one has offered any > justification of the arbitrary decision not to compress columns >1MB, Er, yes, there was discussion before the change, for instance: http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php And do you have any response to this point? I think the right value for this setting is going to depend on theenvironment. If the system is starved for cpu cycles thenyou won't want tocompress large data. If it's starved for i/o bandwidth but has spare cpucycles then you will. http://archives.postgresql.org/pgsql-hackers/2009-01/msg00074.php > and at least one person (Peter) has suggested that it is exactly > backwards. I think he's right, and this part should be backed out. Well the original code had a threshold above which we *always* compresed even if it saved only a single byte. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Tom Lane
Date:
"Robert Haas" <robertmhaas@gmail.com> writes: > The whole thing got started because Alex Hunsaker pointed out that his > database got a lot bigger because we disabled compression on columns > > 1MB. It seems like the obvious thing to do is turn it back on again. I suggest that before we make any knee-jerk responses, we need to go back and reread the prior discussion. The current 8.4 code was proposed here: http://archives.postgresql.org/pgsql-patches/2008-02/msg00053.php and that message links to several older threads that were complaining about the 8.3 behavior. In particular the notion of an upper limit on what we should attempt to compress was discussed in this thread: http://archives.postgresql.org/pgsql-general/2007-08/msg01129.php After poking around in those threads a bit, I think that the current threshold of 1MB was something I just made up on the fly (I did note that it needed tuning...). Perhaps something like 10MB would be a better default. Another possibility is to have different minimum compression rates for "small" and "large" datums. regards, tom lane
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Robert Haas"
Date:
On Mon, Jan 5, 2009 at 2:02 PM, Gregory Stark <stark@enterprisedb.com> wrote: >> Regardless of whether we do that or not, no one has offered any >> justification of the arbitrary decision not to compress columns >1MB, > > Er, yes, there was discussion before the change, for instance: > > http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php OK, maybe I'm missing something, but I don't see anywhere in that email where it suggests NEVER compressing anything above 1MB. It suggests some more nuanced things which are quite different. > And do you have any response to this point? > > I think the right value for this setting is going to depend on the > environment. If the system is starved for cpu cycles then you won't want to > compress large data. If it's starved for i/o bandwidth but has spare cpu > cycles then you will. > > http://archives.postgresql.org/pgsql-hackers/2009-01/msg00074.php I think it is a good point, to the extent that compression is an option that people choose in order to improve performance. I'm not really convinced that this is the case, but I haven't seen much evidence on either side of the question. > Well the original code had a threshold above which we *always* compresed even > if it saved only a single byte. I certainly don't think that's right either. ...Robert
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Holger Hoffstaette"
Date:
On Mon, 05 Jan 2009 13:44:57 -0500, Andrew Chernow wrote: > Robert Haas wrote: >> >> What we do have is a suggestion from several people that the database >> shouldn't be in the business of compressing data AT ALL. If we want DB2 users generally seem very happy with the built-in compression. > IMHO, this is a job for the application. Changing applications is several times more expensive and often simply not possible. -h
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Joshua D. Drake"
Date:
On Mon, 2009-01-05 at 13:04 -0500, Robert Haas wrote: > > Are > > we willing to be dropped from Debian and possibly Red Hat if this is > > the case? > Regardless of whether we do that or not, no one has offered any > justification of the arbitrary decision not to compress columns >1MB, > and at least one person (Peter) has suggested that it is exactly > backwards. I think he's right, and this part should be backed out. > That will leave us back in the sensible place where people who want > compression can get it (which is currently false) and people who don't > want it can get rid of it (which has always been true). If there is a > demand for alternative compression algorithms, then someone can submit > a patch for that for 8.5. > > ...Robert +1 Sincerely, Joshua D. Drake > -- PostgreSQL Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company,serving since 1997
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Mark Mielke
Date:
Guaranteed compression of large data fields is the responsibility of the client. The database should feel free to compress behind the scenes if it turns out to be desirable, but an expectation that it compresses is wrong in my opinion. That said, I'm wondering why compression has to be a problem or why >1 Mbyte is a reasonable compromise? I missed the original thread that lead to 8.4. It seems to me that transparent file system compression doesn't have limits like "files must be less than 1 Mbyte to be compressed". They don't exhibit poor file system performance. I remember back in the 386/486 days, that I would always DriveSpace compress everything, because hard disks were so slow then that DriveSpace would actually increase performance. The toast tables already give a sort of block-addressable scheme. Compression can be on a per block or per set of blocks basis allowing for seek into the block, or if compression doesn't seem to be working for the first few blocks, the later blocks can be stored uncompressed? Or is that too complicated compared to what we have now? :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Stephen R. van den Berg"
Date:
A.M. wrote: >On Jan 5, 2009, at 1:16 PM, Stephen R. van den Berg wrote: >>Upon reading the DFSG, it seems you have a point... >>However... >>QuickLZ is dual licensed: >>a. Royalty-free-perpetuous-use as part of the PostgreSQL backend or >> any derived works of PostgreSQL which link in *at least* 50% of the >> original PostgreSQL codebase. >How does one even define "50% of the original PostgreSQL codebase"? >What nonsense. It's a suggested (but by no means definitive) technical translation of the legalese term "substantial". Substitute with something better, by all means. -- Sincerely, Stephen R. van den Berg. "Well, if we're going to make a party of it, let's nibble Nobby's nuts!"
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Gregory Stark
Date:
Mark Mielke <mark@mark.mielke.cc> writes: > It seems to me that transparent file system compression doesn't have limits > like "files must be less than 1 Mbyte to be compressed". They don't exhibit > poor file system performance. Well I imagine those implementations are more complex than toast is. I'm not sure what lessons we can learn from their behaviour directly. > I remember back in the 386/486 days, that I would always DriveSpace compress > everything, because hard disks were so slow then that DriveSpace would > actually increase performance. Surely this depends on whether your machine was cpu starved or disk starved? Do you happen to recall which camp these anecdotal machines from 1980 fell in? > The toast tables already give a sort of block-addressable scheme. > Compression can be on a per block or per set of blocks basis allowing for > seek into the block, The current toast architecture is that we compress the whole datum, then store the datum either inline or using the same external blocking mechanism that we use when not compressing. So this doesn't fit at all. It does seem like an interesting idea to have toast chunks which are compressed individually. So each chunk could be, say, an 8kb chunk of plaintext and stored as whatever size it ends up being after compression. That would allow us to do random access into external chunks as well as allow overlaying the cpu costs of decompression with the i/o costs. It would get a lower compression ratio than compressing the whole object together but we would have to experiment to see how big a problem that was. It would be pretty much rewriting the toast mechanism for external compressed data though. Currently the storage and the compression are handled separately. This would tie the two together in a separate code path. Hm, It occurs to me we could almost use the existing code. Just store it as a regular uncompressed external datum but allow the toaster to operate on the data column (which it's normally not allowed to) to compress it, but not store it externally. > or if compression doesn't seem to be working for the first few blocks, the > later blocks can be stored uncompressed? Or is that too complicated compared > to what we have now? :-) Actually we do that now, it was part of the same patch we're discussing. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Stephen R. van den Berg"
Date:
Tom Lane wrote: >"Robert Haas" <robertmhaas@gmail.com> writes: >> The whole thing got started because Alex Hunsaker pointed out that his >> database got a lot bigger because we disabled compression on columns > >> 1MB. It seems like the obvious thing to do is turn it back on again. >After poking around in those threads a bit, I think that the current >threshold of 1MB was something I just made up on the fly (I did note >that it needed tuning...). Perhaps something like 10MB would be a >better default. Another possibility is to have different minimum >compression rates for "small" and "large" datums. As far as I can imagine, the following use cases apply: a. Columnsize <= 2048 bytes without substring access. b. Columnsize <= 2048 bytes with substring access. c. Columnsize > 2048 bytes compressible without substring access (text). d. Columnsize > 2048 bytes uncompressible with substring access (multimedia). Can anyone think of another use case I missed here? To cover those cases, the following solutions seem feasible: Sa. Disable compression for this column (manually, by the DBA). Sb. Check if the compression saves more than 20%, store uncompressed otherwise. Sc. Check if the compression saves more than 20%, store uncompressed otherwise. Sd. Check if the compression saves more than 20%, store uncompressed otherwise. For Sb, Sc and Sd we should probably only check the first 256KB or so to determine the expected savings. -- Sincerely, Stephen R. van den Berg. "Well, if we're going to make a party of it, let's nibble Nobby's nuts!"
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes: > Hm, It occurs to me we could almost use the existing code. Just store it as a > regular uncompressed external datum but allow the toaster to operate on the > data column (which it's normally not allowed to) to compress it, but not store > it externally. Yeah, it would be very easy to do that, but the issue then would be that instead of having a lot of toast-chunk rows that are all carefully made to fit exactly 4 to a page, you have a lot of toast-chunk rows of varying size, and you are certainly going to waste some disk space due to not being able to pack pages full. In the worst case you'd end up with zero benefit from compression anyway. As an example, if all of your 2K chunks compress by just under 20%, you get no savings because you can't quite fit 5 to a page. You'd need an average compression rate more than 20% to get savings. We could improve that figure by making the chunk size smaller, but that carries its own performance penalties (more seeks to fetch all of a toasted value). Also, the smaller the chunks the worse the compression will get. It's an interesting idea, and would be easy to try so I hope someone does test it out and see what happens. But I'm not expecting miracles. I think a more realistic approach would be the one somebody suggested upthread: split large values into say 1MB segments that are compressed separately and then stored to TOAST separately. Substring fetches then pay the overhead of decompressing 1MB segments that they might need only part of, but at least they're not pulling out the whole gosh-darn value. As long as the segment size isn't tiny, the added storage inefficiency should be pretty minimal. (How we'd ever do upgrade-in-place to any new compression scheme is an interesting question too...) regards, tom lane
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Mark Mielke
Date:
Gregory Stark wrote: <blockquote cite="mid:871vvhjqv5.fsf@oxford.xeocode.com" type="cite"><pre wrap="">Mark Mielke <a class="moz-txt-link-rfc2396E"href="mailto:mark@mark.mielke.cc"><mark@mark.mielke.cc></a> writes: </pre><blockquotetype="cite"><pre wrap="">It seems to me that transparent file system compression doesn't have limits like "files must be less than 1 Mbyte to be compressed". They don't exhibit poor file system performance. </pre></blockquote><pre wrap=""> Well I imagine those implementations are more complex than toast is. I'm not sure what lessons we can learn from their behaviour directly. </pre><blockquote type="cite"><pre wrap="">I remember backin the 386/486 days, that I would always DriveSpace compress everything, because hard disks were so slow then that DriveSpace would actually increase performance. </pre></blockquote><pre wrap=""> Surely this depends on whether your machine was cpu starved or disk starved? Do you happen to recall which camp these anecdotal machines from 1980 fell in? </pre></blockquote><br /> I agree. I'm sureit was disk I/O starved - and maybe not just the disk. The motherboard might have contributed. :-)<br /><br /> My productionmachine in 2008/2009 for my uses still seems I/O bound. The main database server I use is 2 x Intel Xeon 3.0 Ghz(dual-core) = 4 cores, and the uptime load average for the whole system is currently 0.10. The database and web serveruse their own 4 drives with RAID 10 (main system is on two other drives). Yes, I could always upgrade to a fancy/largerRAID array, SAS, 15k RPM drives, etc. but if a PostgreSQL tweak were to give me 30% more performance at a 15%CPU cost... I think that would be a great alternative option. :-)<br /><br /> Memory may also play a part. My server athome has 4Mbytes of L2 cache and 4Gbytes of RAM running with 5-5-5-18 DDR2 at 1000Mhz. At these speeds, my realized bandwidthfor RAM is 6.0+ Gbyte/s. My L1/L2 operate at 10.0+ Gbyte/s. Compression doesn't run that fast, so at least for me,the benefit of having something in L1/L2 cache vs RAM isn't great, however, my disks in the RAID10 configuraiton onlyread/write at ~150Mbyte/s sustained, and much less if seeking is required. Compressing the data means 30% more data mayfit into RAM or 30% increase in data read from disk, as I assume many compression algorithms can beat 150 Mbyte/s.<br/><br /> Is my configuration typical? It's probably becoming more so. Certainly more common than the 10+ diskhardware RAID configurations.<br /><br /><br /><blockquote cite="mid:871vvhjqv5.fsf@oxford.xeocode.com" type="cite"><blockquotetype="cite"><pre wrap="">The toast tables already give a sort of block-addressable scheme. Compression can be on a per block or per set of blocks basis allowing for seek into the block, </pre></blockquote><pre wrap=""> The current toast architecture is that we compress the whole datum, then store the datum either inline or using the same external blocking mechanism that we use when not compressing. So this doesn't fit at all. It does seem like an interesting idea to have toast chunks which are compressed individually. So each chunk could be, say, an 8kb chunk of plaintext and stored as whatever size it ends up being after compression. That would allow us to do random access into external chunks as well as allow overlaying the cpu costs of decompression with the i/o costs. It would get a lower compression ratio than compressing the whole object together but we would have to experiment to see how big a problem that was. It would be pretty much rewriting the toast mechanism for external compressed data though. Currently the storage and the compression are handled separately. This would tie the two together in a separate code path. Hm, It occurs to me we could almost use the existing code. Just store it as a regular uncompressed external datum but allow the toaster to operate on the data column (which it's normally not allowed to) to compress it, but not store it externally. </pre></blockquote> Yeah - sounds like it could be messy.<br /><br /><blockquote cite="mid:871vvhjqv5.fsf@oxford.xeocode.com"type="cite"><blockquote type="cite"><pre wrap="">or if compression doesn't seemto be working for the first few blocks, the later blocks can be stored uncompressed? Or is that too complicated compared to what we have now? :-) </pre></blockquote><pre wrap=""> Actually we do that now, it was part of the same patch we're discussing. </pre></blockquote><br /> Cheers,<br /> mark<br/><br /><pre class="moz-signature" cols="72">-- Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a> </pre>
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Andrew Chernow
Date:
Holger Hoffstaette wrote: > On Mon, 05 Jan 2009 13:44:57 -0500, Andrew Chernow wrote: > >> Robert Haas wrote: >>> What we do have is a suggestion from several people that the database >>> shouldn't be in the business of compressing data AT ALL. If we want > > DB2 users generally seem very happy with the built-in compression. > >> IMHO, this is a job for the application. > > Changing applications is several times more expensive and often simply not > possible. > > The database can still handle all of the compression requirements if the "application" creates a couple user-defined functions (probably in c) that utilize one of the many existing compression libraries (hand picked for their needs). You can use them in triggers to make it transparent. You can use them directly in statements. You cancontrol selecting the data compressed or uncomrpessed, which is a valid use case for remote clients that have to download a large bytea or text. You can toggle compression algorithms and settings dependant on $whatever. You can do this all of this right w/o the built-in compression, which is my point; why have the built-in compression at all? Seems like a home-cut solution provides more features and control with minimal engineering. All the real engineering is done: the database and compression libraries. All that's left are a few glue functions in C. Well, my two pennies :) -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Robert Haas"
Date:
> I suggest that before we make any knee-jerk responses, we need to go > back and reread the prior discussion. > http://archives.postgresql.org/pgsql-patches/2008-02/msg00053.php > and that message links to several older threads that were complaining > about the 8.3 behavior. In particular the notion of an upper limit > on what we should attempt to compress was discussed in this thread: > http://archives.postgresql.org/pgsql-general/2007-08/msg01129.php Thanks for the pointers. > After poking around in those threads a bit, I think that the current > threshold of 1MB was something I just made up on the fly (I did note > that it needed tuning...). Perhaps something like 10MB would be a > better default. Another possibility is to have different minimum > compression rates for "small" and "large" datums. After reading these discussions, I guess I still don't understand why we would treat small and large datums differently. It seems to me that you had it about right here: http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php # Or maybe it should just be a min_comp_rate and nothing else. # Compressing a 1GB field to 999MB is probably not very sane either. I agree with that. force_input_size doesn't seem like a good idea because compression can be useless on big datums just as it can be on little ones - the obvious case being media file formats that are already internally compressed. Even if you can squeeze a little more out, you're using a lot of CPU time for a very small gain in storage and/or I/O. Furthermore, on a large object, saving even 1MB is not very significant if the datum is 1GB in size - so, again, a percentage seems like the right thing. On the other hand, even after reading these threads, I still don't see any need to disable compression for large datums. I can't think of any reason why I would want to try compressing a 900kB object but not 1MB one. It makes sense to me to not compress if the object doesn't compress well, or if some initial segment of the object doesn't compress well (say, if we can't squeeze 10% out of the first 64kB), but size by itself doesn't seem significant. To put that another way, if small objects and large objects are to be treated differently, which one will we try harder to compress and why?Greg Stark makes an argument that we should try harderwhen it might avoid the need for a toast table: http://archives.postgresql.org/pgsql-hackers/2007-08/msg00087.php ...which has some merit, though clearly it would be a lot better if we could do it when, and only when, it was actually going to work. Also, not compressing very small datums (< 256 bytes) also seems smart, since that could end up producing a lot of extra compression attempts, most of which will end up saving little or no space. Apart from those two cases I don't see any clear motivation for discriminating on size. ...Robert
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Bruce Momjian
Date:
Robert Haas wrote: > > After poking around in those threads a bit, I think that the current > > threshold of 1MB was something I just made up on the fly (I did note > > that it needed tuning...). Perhaps something like 10MB would be a > > better default. Another possibility is to have different minimum > > compression rates for "small" and "large" datums. > > After reading these discussions, I guess I still don't understand why > we would treat small and large datums differently. It seems to me > that you had it about right here: > > http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php > > # Or maybe it should just be a min_comp_rate and nothing else. > # Compressing a 1GB field to 999MB is probably not very sane either. > > I agree with that. force_input_size doesn't seem like a good idea > because compression can be useless on big datums just as it can be on > little ones - the obvious case being media file formats that are > already internally compressed. Even if you can squeeze a little more > out, you're using a lot of CPU time for a very small gain in storage > and/or I/O. Furthermore, on a large object, saving even 1MB is not > very significant if the datum is 1GB in size - so, again, a percentage > seems like the right thing. > > On the other hand, even after reading these threads, I still don't see > any need to disable compression for large datums. I can't think of > any reason why I would want to try compressing a 900kB object but not > 1MB one. It makes sense to me to not compress if the object doesn't > compress well, or if some initial segment of the object doesn't > compress well (say, if we can't squeeze 10% out of the first 64kB), > but size by itself doesn't seem significant. > > To put that another way, if small objects and large objects are to be > treated differently, which one will we try harder to compress and why? > Greg Stark makes an argument that we should try harder when it might > avoid the need for a toast table: > > http://archives.postgresql.org/pgsql-hackers/2007-08/msg00087.php > > ...which has some merit, though clearly it would be a lot better if we > could do it when, and only when, it was actually going to work. Also, > not compressing very small datums (< 256 bytes) also seems smart, > since that could end up producing a lot of extra compression attempts, > most of which will end up saving little or no space. > > Apart from those two cases I don't see any clear motivation for > discriminating on size. Agreed. I have seen a lot of discussion on this topic and the majority seems to feel that a size limit on compress doesn't make sense in the general case. It is true that there is dimminished performance for substring operations as TOAST values get longer but compression does give better performance for longer values for full field retrieval. I don't think we should be optimizing TOAST for substrings --- users who know they are going to be using substrings can specify the storage type for the column directly. Having any kind of maximum makes it hard for administrators to know exactly what is happening in TOAST. I think the upper limit should be removed with a documentation mention in the substring() section mentioning the use of non-compressed TOAST storage. The only way I think an upper compression limit makes sense is if the backend can't uncompress the value to return it to the user, but then you have to wonder how the value got into the database in the first place. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Tom Lane
Date:
"Robert Haas" <robertmhaas@gmail.com> writes: > After reading these discussions, I guess I still don't understand why > we would treat small and large datums differently. It seems to me > that you had it about right here: > http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php > # Or maybe it should just be a min_comp_rate and nothing else. > # Compressing a 1GB field to 999MB is probably not very sane either. Well, that's okay with me. I think that the other discussion was mainly focused on the silliness of compressing large datums when only a small percentage could be saved. What we might do for the moment is just to set the upper limit to INT_MAX in the default strategy, rather than rip out the logic altogether. IIRC that limit is checked only once per compression, not in the inner loop, so it won't cost us any noticeable performance to leave the logic there in case someone finds a use for it. > not compressing very small datums (< 256 bytes) also seems smart, > since that could end up producing a lot of extra compression attempts, > most of which will end up saving little or no space. But note that the current code will usually not try to do that anyway, at least for rows of ordinary numbers of columns. The present code has actually reduced the lower-bound threshold from where it used to be. I think that if anyone wants to argue for a different value, it'd be time to whip out some actual tests ;-). We can't set specific parameter values from gedanken-experiments. regards, tom lane
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Gregory Stark
Date:
> "Robert Haas" <robertmhaas@gmail.com> writes: > >> not compressing very small datums (< 256 bytes) also seems smart, >> since that could end up producing a lot of extra compression attempts, >> most of which will end up saving little or no space. That was presumably the rationale for the original logic. However experience shows that there are certainly databases that store a lot of compressible short strings. Obviously databases with CHAR(n) desperately need us to compress them. But even plain text data are often moderately compressible even with our fairly weak compression algorithm. One other thing that bothers me about our toast mechanism is that it only kicks in for tuples that are "too large". It seems weird that the same column is worth compressing or not depending on what other columns are in the same tuple. If you store a 2000 byte tuple that's all spaces we don't try to compress it at all. But if you added one more attribute we would go to great lengths compressing and storing attributes externally -- not necessarily the attribute you just added, the ones that were perfectly fine previously -- to try to get it under 2k. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
LasseReinhold
Date:
Stephen R. van den Berg wrote: > > I asked the author of the QuickLZ algorithm about licensing... > Sounds like he is willing to cooperate. This is what I got from him: > > On Sat, Jan 3, 2009 at 17:56, Lasse Reinhold <lar@quicklz.com> wrote: >> Hi Stephen, >> >> That sounds really exciting, I'd love to see QuickLZ included into >> PostgreSQL. I'd be glad to offer support and add custom optimizations, >> features or hacks or whatever should turn up. >> >> My only concern is to avoid undermining the commercial license, but this >> can, as you suggest, be solved by exceptionally allowing QuickLZ to be >> linked with PostgreSQL. Since I have exclusive copyright of QuickLZ any >> construction is possible. > Another solution could be to make PostgreSQL prepared for using compression with QuickLZ, letting the end user download QuickLZ separately and enable it with a compiler flag during compilation. -- View this message in context: http://www.nabble.com/QuickLZ-compression-algorithm-%28Re%3A-Inclusion-in-the-PostgreSQL-backend-for-toasting-rows%29-tp21284024p21307987.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
"Robert Haas"
Date:
>>> not compressing very small datums (< 256 bytes) also seems smart, >>> since that could end up producing a lot of extra compression attempts, >>> most of which will end up saving little or no space. > > That was presumably the rationale for the original logic. However experience > shows that there are certainly databases that store a lot of compressible > short strings. > > Obviously databases with CHAR(n) desperately need us to compress them. But > even plain text data are often moderately compressible even with our fairly > weak compression algorithm. > > One other thing that bothers me about our toast mechanism is that it only > kicks in for tuples that are "too large". It seems weird that the same column > is worth compressing or not depending on what other columns are in the same > tuple. That's a fair point. There's definitely some inconsistency in the current behavior. It seems to me that, in theory, compression and out-of-line storage are two separate behaviors. Out-of-line storage is pretty much a requirement for dealing with large objects, given that the page size is a constant; compression is not a requirement, but definitely beneficial under some circumstances, particularly when it removes the need for out-of-line storage. char(n) is kind of a wierd case because you could also compress by storing a count of the trailing spaces, without applying a general-purpose compression algorithm. But either way the field is no longer fixed-width, and therefore field access can't be done as a simple byte offset from the start of the tuple. It's difficult even to enumerate the possible use cases, let alone what knobs would be needed to cater to all of them. ...Robert
Re: QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
From
Alvaro Herrera
Date:
Robert Haas escribió: > char(n) is kind of a wierd case because you could also compress by > storing a count of the trailing spaces, without applying a > general-purpose compression algorithm. But either way the field is no > longer fixed-width, and therefore field access can't be done as a > simple byte offset from the start of the tuple. That's not the case anyway (fixed byte width) due to possible multibyte chars. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.