Thread: a faster compression algorithm for pg_dump

a faster compression algorithm for pg_dump

From
Joachim Wieland
Date:
I'd like to revive the discussion about offering another compression
algorithm than zlib to at least pg_dump. There has been a previous
discussion here:

http://archives.postgresql.org/pgsql-performance/2009-08/msg00053.php

and it ended without any real result. The results so far were:

- There exist BSD-licensed compression algorithms
- Nobody knows a patent that is in our way
- Nobody can confirm that no patent is in our way

I do see a very real demand for replacing zlib which compresses quite
well but is slow as hell. For pg_dump what people want is cheap
compression, they usually prefer an algorithm that compresses less
optimal but that is really fast.

One question that I do not yet see answered is, do we risk violating a
patent even if we just link against a compression library, for example
liblzf, without shipping the actual code?

I have checked what other projects do, especially about liblzf which
would be my favorite choice (BSD license, available since quite some
time...) and there are other projects that actually ship the lzf code
(I haven't found a project that just links to it). The most prominent
projects are

- KOffice (implements a derived version in
koffice-2.1.2/libs/store/KoXmlReader.cpp)
- Virtual Box (ships it in vbox-ose-1.3.8/src/libs/liblzf-1.51)
- TuxOnIce (formerly known as suspend2 - linux kernel patch, ships it
in the patch)

We have pg_lzcompress.c which implements the compression routines for
the tuple toaster. Are we sure that we don't violate any patents with
this algorithm?


Joachim


Re: a faster compression algorithm for pg_dump

From
Greg Stark
Date:
On Fri, Apr 9, 2010 at 12:17 AM, Joachim Wieland <joe@mcknight.de> wrote:
> One question that I do not yet see answered is, do we risk violating a
> patent even if we just link against a compression library, for example
> liblzf, without shipping the actual code?
>

Generally patents are infringed on when the process is used. So
whether we link against or ship the code isn't really relevant. The
user using the software would need a patent license either way. We
want Postgres to be usable without being dependent on any copyright or
patent licenses.

Linking against as an option isn't nearly as bad since the user
compiling it can choose whether to include the restricted feature or
not. That's what we do with readline. However it's not nearly as
attractive when it restricts what file formats Postgres supports -- it
means someone might generate backup dump files that they later
discover they don't have a legal right to read and restore :(

-- 
greg


Re: a faster compression algorithm for pg_dump

From
Joachim Wieland
Date:
On Fri, Apr 9, 2010 at 5:51 AM, Greg Stark <gsstark@mit.edu> wrote:
> Linking against as an option isn't nearly as bad since the user
> compiling it can choose whether to include the restricted feature or
> not. That's what we do with readline. However it's not nearly as
> attractive when it restricts what file formats Postgres supports -- it
> means someone might generate backup dump files that they later
> discover they don't have a legal right to read and restore :(

If we only linked against it, we'd leave it up to the user to weigh
the risk as long as we are not aware of any such violation.

Our top priority is to make sure that the project would not be harmed
if one day such a patent showed up. If I understood you correctly,
this is not an issue, even if we included lzf and less again if we
only link against it. The rest is about user education and using lzf
only in pg_dump and not for toasting, we could show a message in
pg_dump if lzf is chosen to make the user aware of the possible
issues.

If we still cannot do this, then what I am asking is: What does the
project need to be able to at least link against such a compression
algorithm? Is it a list of 10, 20, 50 or more other projects using it
or is it a lawyer saying: "There is no patent."? But then, how can we
be sure that the lawyer is right? Or couldn't we include it even if we
had both, because again, we couldn't be sure... ?


Joachim


Re: a faster compression algorithm for pg_dump

From
Tom Lane
Date:
Joachim Wieland <joe@mcknight.de> writes:
> If we still cannot do this, then what I am asking is: What does the
> project need to be able to at least link against such a compression
> algorithm?

Well, what we *really* need is a convincing argument that it's worth
taking some risk for.  I find that not obvious.  You can pipe the output
of pg_dump into your-choice-of-compressor, for example, and that gets
you the ability to spread the work across multiple CPUs in addition to
eliminating legal risk to the PG project.  And in any case the general
impression seems to be that the main dump-speed bottleneck is on the
backend side not in pg_dump's compression.
        regards, tom lane


Re: a faster compression algorithm for pg_dump

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Joachim Wieland <joe@mcknight.de> writes:
>> If we still cannot do this, then what I am asking is: What does the
>> project need to be able to at least link against such a compression
>> algorithm?
> 
> Well, what we *really* need is a convincing argument that it's worth
> taking some risk for.  I find that not obvious.  You can pipe the output
> of pg_dump into your-choice-of-compressor, for example, and that gets
> you the ability to spread the work across multiple CPUs in addition to
> eliminating legal risk to the PG project.  And in any case the general
> impression seems to be that the main dump-speed bottleneck is on the
> backend side not in pg_dump's compression.

legal risks aside (I'm not a lawyer so I cannot comment on that) the 
current situation imho is:

* for a plain pg_dump the backend is the bottleneck
* for a pg_dump -Fc with compression, compression is a huge bottleneck
* for pg_dump | gzip, it is usually compression (or bytea and some other 
datatypes in <9.0)
* for a parallel dump you can either dump uncompressed and compress 
afterwards which increases diskspace requirements (and if you need 
parallel dump you usually have a large database) and complexity (because 
you would have to think about how to manually parallel the compression
* for a parallel dump that compresses inline you are limited by the 
compression algorithm on a per core base and given that the current 
inline compression overhead is huge you loose a lot of the benefits of 
parallel dump


Stefan


Re: a faster compression algorithm for pg_dump

From
Dimitri Fontaine
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> Well, what we *really* need is a convincing argument that it's worth
> taking some risk for.  I find that not obvious.  You can pipe the output
> of pg_dump into your-choice-of-compressor, for example, and that gets
> you the ability to spread the work across multiple CPUs in addition to
> eliminating legal risk to the PG project.

Well, I like -Fc and playing with the catalog to restore in staging
environments only the "interesting" data. I even automated all the
catalog mangling in pg_staging so that I just have to setup which
schema I want, with only the DDL or with the DATA too.
 The fun is when you want to exclude functions that are used in triggers based on the schema where the function lives,
notthe trigger, BTW, but that's another story. 

So yes having both -Fc and another compression facility than plain gzip
would be good news. And benefiting from a better compression in TOAST
would be good too I guess (small size hit, lots faster, would fit).

Summary : my convincing argument is using the dumps for efficiently
preparing development and testing environments from production data,
thanks to -Fc. That includes skipping data to restore.

Regards,
--
dim


Re: a faster compression algorithm for pg_dump

From
Bruce Momjian
Date:
Dimitri Fontaine wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> > Well, what we *really* need is a convincing argument that it's worth
> > taking some risk for.  I find that not obvious.  You can pipe the output
> > of pg_dump into your-choice-of-compressor, for example, and that gets
> > you the ability to spread the work across multiple CPUs in addition to
> > eliminating legal risk to the PG project. 
> 
> Well, I like -Fc and playing with the catalog to restore in staging
> environments only the "interesting" data. I even automated all the
> catalog mangling in pg_staging so that I just have to setup which
> schema I want, with only the DDL or with the DATA too.
> 
>   The fun is when you want to exclude functions that are used in
>   triggers based on the schema where the function lives, not the
>   trigger, BTW, but that's another story.
> 
> So yes having both -Fc and another compression facility than plain gzip
> would be good news. And benefiting from a better compression in TOAST
> would be good too I guess (small size hit, lots faster, would fit).
> 
> Summary?: my convincing argument is using the dumps for efficiently
> preparing development and testing environments from production data,
> thanks to -Fc. That includes skipping data to restore.

I assume people realize that if they are using pg_dump -Fc and then
compressing the output later, they should turn off compression in
pg_dump, or is that something we should document/suggest?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com


Re: a faster compression algorithm for pg_dump

From
daveg
Date:
On Tue, Apr 13, 2010 at 03:03:58PM -0400, Tom Lane wrote:
> Joachim Wieland <joe@mcknight.de> writes:
> > If we still cannot do this, then what I am asking is: What does the
> > project need to be able to at least link against such a compression
> > algorithm?
> 
> Well, what we *really* need is a convincing argument that it's worth
> taking some risk for.  I find that not obvious.  You can pipe the output
> of pg_dump into your-choice-of-compressor, for example, and that gets
> you the ability to spread the work across multiple CPUs in addition to
> eliminating legal risk to the PG project.  And in any case the general
> impression seems to be that the main dump-speed bottleneck is on the
> backend side not in pg_dump's compression.

My client uses pg_dump -Fc and produces about 700GB of compressed postgresql
dump nightly from multiple hosts. They also depend on being able to read and
filter the dump catalog. A faster compression algorithm would be a huge
benefit for dealing with this volume.

-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.