Thread: alternative compression algorithms?

alternative compression algorithms?

From
Tomas Vondra
Date:
Hi there,

in the past we've repeatedly discussed the option of using a different
compression algorithm (e.g. lz4), but every time the discussion died off
because of fear of possible patent issues [1] [2] and many other
threads. Have we decided it's not worth the risks, making patches in
this area futile?

The reason why I'm asking about this is the multivariate statistics
patch - while optimizing the planning overhead, I realized that
considerable amount of time is spent decompressing the statistics
(serialized as bytea), and using an algorithm with better decompression
performance (lz4 comes to mind) would help a lot. The statistics may be
a few tens/hundreds kB, and in the planner every millisecond counts.

Would a differentiated approach work? That is, either adding an initdb
option allowing the user to choose an alternative compression algorithm
(and thus let him consider the possible patent issues), or using
different algorithms for different pieces of data (e.g. keep pglz for
the user data, and lz4 for statistics).

The first option is quite trivial to implement - I already have an
experimental patch implementing that (attached, but a bit dirty). The
second option is probably more difficult (we'd have to teach tuple
toaster about multiple compression algorithms and pass that information
somehow). Also, I'm not sure it'd make the patent concerns go away ...

I'm a bit confused though, because I've noticed various other FOSS
projects adopting lz4 over the past few years and I'm yet to find a
project voicing the same concerns about patents. So either they're
reckless or we're excessively paranoid.

Also, lz4 is not the only compression algorithm available - I've done a
bunch of tests with lz4, lz4hc, lzo and snappy, and lzo actually
performed better than lz4 (not claiming that's a universal truth). But I
suppose that the patent concerns are not somehow specific to lz4 but
about the compression in general.


[1] http://www.postgresql.org/message-id/50EA7976.5060809@lab.ntt.co.jp
[2]
http://www.postgresql.org/message-id/20130614230142.GC19641@awork2.anarazel.de

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: alternative compression algorithms?

From
Michael Paquier
Date:
On Mon, Apr 20, 2015 at 5:51 AM, Tomas Vondra wrote:
> I'm a bit confused though, because I've noticed various other FOSS projects
> adopting lz4 over the past few years and I'm yet to find a project voicing
> the same concerns about patents. So either they're reckless or we're
> excessively paranoid.

Both are true, now being very careful regarding software patents
usually pays a lot. Strange things happen in the legal world.

> Also, lz4 is not the only compression algorithm available - I've done a
> bunch of tests with lz4, lz4hc, lzo and snappy, and lzo actually performed
> better than lz4 (not claiming that's a universal truth). But I suppose that
> the patent concerns are not somehow specific to lz4 but about the
> compression in general.

Some proprietary algorithms may perform even better and faster, who
knows for sure? And even if for now lzo or lz4 are better, we may find
something still better than what we think is now. In short, I still
tend to think that the right answer would be to have a dedicated set
of hooks and let people play with the compression algorithms they want
(statement stands for FPW compression as well, but that's separate
than the multivariate statistics patch).
Regards,
-- 
Michael



Re: alternative compression algorithms?

From
Stephen Frost
Date:
* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Mon, Apr 20, 2015 at 5:51 AM, Tomas Vondra wrote:
> > I'm a bit confused though, because I've noticed various other FOSS projects
> > adopting lz4 over the past few years and I'm yet to find a project voicing
> > the same concerns about patents. So either they're reckless or we're
> > excessively paranoid.
>
> Both are true, now being very careful regarding software patents
> usually pays a lot. Strange things happen in the legal world.
>
> > Also, lz4 is not the only compression algorithm available - I've done a
> > bunch of tests with lz4, lz4hc, lzo and snappy, and lzo actually performed
> > better than lz4 (not claiming that's a universal truth). But I suppose that
> > the patent concerns are not somehow specific to lz4 but about the
> > compression in general.
>
> Some proprietary algorithms may perform even better and faster, who
> knows for sure? And even if for now lzo or lz4 are better, we may find
> something still better than what we think is now. In short, I still
> tend to think that the right answer would be to have a dedicated set
> of hooks and let people play with the compression algorithms they want
> (statement stands for FPW compression as well, but that's separate
> than the multivariate statistics patch).

I'm no lawyer, but I understand that one useful defense against patent
infringment lawsuits is to stop infringing on the patent (eg: stop doing
whatever it is the patent is about).  Therefore, the best approach would
be exactly this- provide a way to swap out what the compression
algorithm is.  One issue we'll need to consider is how to do that for
stored data, as you'd need to recompress everything.  What that
basically means is we'll need to burn bits to indicate which compression
algorithm is used.

For my 2c, I suspect that's worth it as I expect the gains to be worth
those extra bits, in the general case, but I'm sure there will be cases
where it's a loss.
Thanks,
    Stephen

Re: alternative compression algorithms?

From
Andres Freund
Date:
Hi,

On 2015-04-19 22:51:53 +0200, Tomas Vondra wrote:
> The reason why I'm asking about this is the multivariate statistics patch -
> while optimizing the planning overhead, I realized that considerable amount
> of time is spent decompressing the statistics (serialized as bytea), and
> using an algorithm with better decompression performance (lz4 comes to mind)
> would help a lot. The statistics may be a few tens/hundreds kB, and in the
> planner every millisecond counts.

I've a hard time believing that a different compression algorithm is
going to be the solution for that. Decompressing, quite possibly
repeatedly, several hundreds of kb during planning isn't going to be
fun. Even with a good, fast, compression algorithm.

But independent of this, it'd be good to make progress on it.

> Would a differentiated approach work? That is, either adding an initdb
> option allowing the user to choose an alternative compression algorithm (and
> thus let him consider the possible patent issues), or using different
> algorithms for different pieces of data (e.g. keep pglz for the user data,
> and lz4 for statistics).
> 
> The first option is quite trivial to implement - I already have an
> experimental patch implementing that (attached, but a bit dirty). The second
> option is probably more difficult (we'd have to teach tuple toaster about
> multiple compression algorithms and pass that information somehow). Also,
> I'm not sure it'd make the patent concerns go away ...

Note that I had implemented the "second option" somewhere down the lane
in the [2] you refer to.

> [2] http://www.postgresql.org/message-id/20130614230142.GC19641@awork2.anarazel.de

Greetings,

Andres Freund



Re: alternative compression algorithms?

From
Heikki Linnakangas
Date:
On 04/19/2015 11:51 PM, Tomas Vondra wrote:
> Hi there,
>
> in the past we've repeatedly discussed the option of using a different
> compression algorithm (e.g. lz4), but every time the discussion died off
> because of fear of possible patent issues [1] [2] and many other
> threads. Have we decided it's not worth the risks, making patches in
> this area futile?
> ...
> I'm a bit confused though, because I've noticed various other FOSS
> projects adopting lz4 over the past few years and I'm yet to find a
> project voicing the same concerns about patents. So either they're
> reckless or we're excessively paranoid.

IMHO we should switch to a different algorithm. Wrt patents, if we 
choose an algorithm that's already used widely in several other open 
source projects, that's safe enough. It's as safe as we're going to get.

There is always the chance that you infringe on a patent when you write 
any code. Compression is a particularly heavily-patented field, but it's 
nevertheless a bit strange to be super-paranoid there, and not worry 
about patents elsewhere. (The safest option would be to pick an 
algorithm that was patented, but the patent has expired. I don't think 
any of the compression algorithms discussed recently are old enough for 
that, though.)

- Heikki




Re: alternative compression algorithms?

From
Tomas Vondra
Date:

On 04/20/15 05:07, Andres Freund wrote:
> Hi,
>
> On 2015-04-19 22:51:53 +0200, Tomas Vondra wrote:
>> The reason why I'm asking about this is the multivariate statistics patch -
>> while optimizing the planning overhead, I realized that considerable amount
>> of time is spent decompressing the statistics (serialized as bytea), and
>> using an algorithm with better decompression performance (lz4 comes to mind)
>> would help a lot. The statistics may be a few tens/hundreds kB, and in the
>> planner every millisecond counts.
>
> I've a hard time believing that a different compression algorithm is
>  going to be the solution for that. Decompressing, quite possibly
> repeatedly, several hundreds of kb during planning isn't going to be
>  fun. Even with a good, fast, compression algorithm.

Sure, it's not an ultimate solution, but it might help a bit. I do have 
other ideas how to optimize this, but in the planner every milisecond 
counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.

regards
Tomas



Re: alternative compression algorithms?

From
Robert Haas
Date:
On Mon, Apr 20, 2015 at 9:03 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Sure, it's not an ultimate solution, but it might help a bit. I do have
> other ideas how to optimize this, but in the planner every milisecond
> counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.

I suggested years ago that we should not compress data in
pg_statistic.  Tom shot that down, but I don't understand why.  It
seems to me that when we know data is extremely frequently accessed,
storing it uncompressed makes sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: alternative compression algorithms?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Apr 20, 2015 at 9:03 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Sure, it's not an ultimate solution, but it might help a bit. I do have
>> other ideas how to optimize this, but in the planner every milisecond
>> counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.

> I suggested years ago that we should not compress data in
> pg_statistic.  Tom shot that down, but I don't understand why.  It
> seems to me that when we know data is extremely frequently accessed,
> storing it uncompressed makes sense.

I've not been following this thread, but I do not think your argument here
holds any water.  pg_statistic entries are generally fetched via the
syscaches, and we fixed things years ago so that toasted tuple entries
are detoasted before insertion in syscache.  So I don't believe that
preventing on-disk compression would make for any significant improvement,
at least not after the first reference within a session.

Also, it's a very long way from "some pg_statistic entries are frequently
accessed" to "all pg_statistic entries are frequently accessed".
        regards, tom lane



Re: alternative compression algorithms?

From
Tomas Vondra
Date:
Hi,

On 04/29/15 23:54, Robert Haas wrote:
> On Mon, Apr 20, 2015 at 9:03 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Sure, it's not an ultimate solution, but it might help a bit. I do have
>> other ideas how to optimize this, but in the planner every milisecond
>> counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.
>
> I suggested years ago that we should not compress data in
> pg_statistic.  Tom shot that down, but I don't understand why.  It
> seems to me that when we know data is extremely frequently accessed,
> storing it uncompressed makes sense.

I'm not convinced not compressing the data is a good idea - it suspect 
it would only move the time to TOAST, increase memory pressure (in 
general and in shared buffers). But I think that using a more efficient 
compression algorithm would help a lot.

For example, when profiling the multivariate stats patch (with multiple 
quite large histograms), the pglz_decompress is #1 in the profile, 
occupying more than 30% of the time. After replacing it with the lz4, 
the data are bit larger, but it drops to ~0.25% in the profile and 
planning the drops proportionally.

It's not a silver bullet, but it would help a lot in those cases.


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: alternative compression algorithms?

From
Petr Jelinek
Date:
On 30/04/15 00:44, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Apr 20, 2015 at 9:03 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> Sure, it's not an ultimate solution, but it might help a bit. I do have
>>> other ideas how to optimize this, but in the planner every milisecond
>>> counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.
>
>> I suggested years ago that we should not compress data in
>> pg_statistic.  Tom shot that down, but I don't understand why.  It
>> seems to me that when we know data is extremely frequently accessed,
>> storing it uncompressed makes sense.
>
> I've not been following this thread, but I do not think your argument here
> holds any water.  pg_statistic entries are generally fetched via the
> syscaches, and we fixed things years ago so that toasted tuple entries
> are detoasted before insertion in syscache.  So I don't believe that
> preventing on-disk compression would make for any significant improvement,
> at least not after the first reference within a session.

AFAICS the syscache tuples are detoasted but not decompressed before 
insertion to syscache (catcache calls toast_flatten_tuple which doesn't 
do decompression) so the pglz_decompress is still called every time.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: alternative compression algorithms?

From
Robert Haas
Date:
On Wed, Apr 29, 2015 at 6:55 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I'm not convinced not compressing the data is a good idea - it suspect it
> would only move the time to TOAST, increase memory pressure (in general and
> in shared buffers). But I think that using a more efficient compression
> algorithm would help a lot.
>
> For example, when profiling the multivariate stats patch (with multiple
> quite large histograms), the pglz_decompress is #1 in the profile, occupying
> more than 30% of the time. After replacing it with the lz4, the data are bit
> larger, but it drops to ~0.25% in the profile and planning the drops
> proportionally.

That seems to imply a >100x improvement in decompression speed.  Really???

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: alternative compression algorithms?

From
Tomas Vondra
Date:

On 04/30/15 02:42, Robert Haas wrote:
> On Wed, Apr 29, 2015 at 6:55 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I'm not convinced not compressing the data is a good idea - it suspect it
>> would only move the time to TOAST, increase memory pressure (in general and
>> in shared buffers). But I think that using a more efficient compression
>> algorithm would help a lot.
>>
>> For example, when profiling the multivariate stats patch (with multiple
>> quite large histograms), the pglz_decompress is #1 in the profile, occupying
>> more than 30% of the time. After replacing it with the lz4, the data are bit
>> larger, but it drops to ~0.25% in the profile and planning the drops
>> proportionally.
>
> That seems to imply a >100x improvement in decompression speed.  Really???

Sorry, that was a bit misleading over-statement. The profiles (same 
dataset, same workload) look like this:


pglz_decompress
---------------  44.51%  postgres      [.] pglz_decompress  13.60%  postgres      [.] update_match_bitmap_histogram
8.40% postgres      [.] float8_cmp_internal   7.43%  postgres      [.] float8lt   6.49%  postgres      [.]
deserialize_mv_histogram  6.23%  postgres      [.] FunctionCall2Coll   4.06%  postgres      [.] DatumGetFloat8   3.48%
libc-2.18.so [.] __isnan   1.26%  postgres      [.] clauselist_mv_selectivity   1.09%  libc-2.18.so  [.]
__memcpy_sse2_unaligned

lz4
---  18.05%  postgres          [.] update_match_bitmap_histogram  11.67%  postgres          [.] float8_cmp_internal
10.53% postgres          [.] float8lt   8.67%  postgres          [.] FunctionCall2Coll   8.52%  postgres          [.]
deserialize_mv_histogram  5.52%  postgres          [.] DatumGetFloat8   4.90%  libc-2.18.so      [.] __isnan   3.92%
liblz4.so.1.6.0  [.] 0x0000000000002603   2.08%  liblz4.so.1.6.0   [.] 0x0000000000002847   1.81%  postgres
[.]clauselist_mv_selectivity   1.47%  libc-2.18.so      [.] __memcpy_sse2_unaligned   1.33%  liblz4.so.1.6.0   [.]
0x000000000000260f  1.16%  liblz4.so.1.6.0   [.] 0x00000000000025e3   (and then a long tail of other lz4 calls)
 

The difference used to more significant, but I've done a lot of 
improvements in the update_match_bitmap method (so the lz4 methods are 
more significant).

The whole script (doing a lot of estimates) takes 1:50 with pglz and 
only 1:25 with lz4. That's ~25-30% improvement.

The results are slightly unreliable because collected in a Xen VM, and 
the overhead is non-negligible (but the same in both cases). I wouldn't 
be surprised if the difference was more significant without the VM.

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: alternative compression algorithms?

From
Robert Haas
Date:
On Wed, Apr 29, 2015 at 9:12 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> The whole script (doing a lot of estimates) takes 1:50 with pglz and only
> 1:25 with lz4. That's ~25-30% improvement.

Still pretty good....

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company