Re: Add LZ4 compression in pg_dump - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Add LZ4 compression in pg_dump
Date
Msg-id 09b37949-cfe5-29bb-cbb1-498ee5700b61@enterprisedb.com
Whole thread Raw
In response to Re: Add LZ4 compression in pg_dump  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: Add LZ4 compression in pg_dump
List pgsql-hackers

On 2/27/23 05:49, Justin Pryzby wrote:
> On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
>> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
>>> I have some fixes (attached) and questions while polishing the patch for
>>> zstd compression.  The fixes are small and could be integrated with the
>>> patch for zstd, but could be applied independently.
>>
>> One more - WriteDataToArchiveGzip() says:
> 
> One more again.
> 
> The LZ4 path is using non-streaming mode, which compresses each block
> without persistent state, giving poor compression for -Fc compared with
> -Fp.  If the data is highly compressible, the difference can be orders
> of magnitude.
> 
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
> 12351763
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> 21890708
> 
> That's not true for gzip:
> 
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
> 2118869
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
> 2115832
> 
> The function ought to at least use streaming mode, so each block/row
> isn't compressioned in isolation.  003 is a simple patch to use
> streaming mode, which improves the -Fc case:
> 
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> 15178283
> 
> However, that still flushes the compression buffer, writing a block
> header, for every row.  With a single-column table, pg_dump -Fc -Z lz4
> still outputs ~10% *more* data than with no compression at all.  And
> that's for compressible data.
> 
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
> 12890296
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
> 11890296
> 
> I think this should use the LZ4F API with frames, which are buffered to
> avoid outputting a header for every single row.  The LZ4F format isn't
> compatible with the LZ4 format, so (unlike changing to the streaming
> API) that's not something we can change in a bugfix release.  I consider
> this an Opened Item.
> 
> With the LZ4F API in 004, -Fp and -Fc are essentially the same size
> (like gzip).  (Oh, and the output is three times smaller, too.)
> 
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
> 4155448
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
> 4156548
> 

Thanks. Those are definitely interesting improvements/optimizations!

I suggest we track them as a separate patch series - please add them to
the CF app (I guess you'll have to add them to 2023-07 at this point,
but we can get them in, I think).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Time delayed LR (WAS Re: logical replication restrictions)
Next
From: Jim Jones
Date:
Subject: Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)