Re: Add LZ4 compression in pg_dump - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Add LZ4 compression in pg_dump
Date
Msg-id 2f680c4f-0944-5fd0-2e39-8cd95ccb89f5@enterprisedb.com
Whole thread Raw
In response to Re: Add LZ4 compression in pg_dump  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: Add LZ4 compression in pg_dump  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers

On 3/9/23 17:15, Justin Pryzby wrote:
> On Wed, Mar 01, 2023 at 05:39:54PM +0100, Tomas Vondra wrote:
>> On 2/27/23 05:49, Justin Pryzby wrote:
>>> On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
>>>> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
>>>>> I have some fixes (attached) and questions while polishing the patch for
>>>>> zstd compression.  The fixes are small and could be integrated with the
>>>>> patch for zstd, but could be applied independently.
>>>>
>>>> One more - WriteDataToArchiveGzip() says:
>>>
>>> One more again.
>>>
>>> The LZ4 path is using non-streaming mode, which compresses each block
>>> without persistent state, giving poor compression for -Fc compared with
>>> -Fp.  If the data is highly compressible, the difference can be orders
>>> of magnitude.
>>>
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
>>> 12351763
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
>>> 21890708
>>>
>>> That's not true for gzip:
>>>
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
>>> 2118869
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
>>> 2115832
>>>
>>> The function ought to at least use streaming mode, so each block/row
>>> isn't compressioned in isolation.  003 is a simple patch to use
>>> streaming mode, which improves the -Fc case:
>>>
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
>>> 15178283
>>>
>>> However, that still flushes the compression buffer, writing a block
>>> header, for every row.  With a single-column table, pg_dump -Fc -Z lz4
>>> still outputs ~10% *more* data than with no compression at all.  And
>>> that's for compressible data.
>>>
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
>>> 12890296
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
>>> 11890296
>>>
>>> I think this should use the LZ4F API with frames, which are buffered to
>>> avoid outputting a header for every single row.  The LZ4F format isn't
>>> compatible with the LZ4 format, so (unlike changing to the streaming
>>> API) that's not something we can change in a bugfix release.  I consider
>>> this an Opened Item.
>>>
>>> With the LZ4F API in 004, -Fp and -Fc are essentially the same size
>>> (like gzip).  (Oh, and the output is three times smaller, too.)
>>>
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
>>> 4155448
>>> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
>>> 4156548
>>
>> Thanks. Those are definitely interesting improvements/optimizations!
>>
>> I suggest we track them as a separate patch series - please add them to
>> the CF app (I guess you'll have to add them to 2023-07 at this point,
>> but we can get them in, I think).
> 
> Thanks for looking.  I'm not sure if I'm the best person to write/submit
> the patch to implement that for LZ4.  Georgios, would you want to take
> on this change ?
> 
> I think that needs to be changed for v16, since 1) LZ4F works so much
> better like this, and 2) we can't change it later without breaking
> compatibility of the dumpfiles by changing the header with another name
> other than "lz4".  Also, I imagine we'd want to continue supporting the
> ability to *restore* a dumpfile using the old(current) format, which
> would be untestable code unless we also preserved the ability to write
> it somehow (like -Z lz4-old).
> 

I'm a bit confused about the lz4 vs. lz4f stuff, TBH. If we switch to
lz4f, doesn't that mean it (e.g. restore) won't work on systems that
only have older lz4 version? What would/should happen if we take backup
compressed with lz4f, an then try restoring it on an older system where
lz4 does not support lz4f?

Maybe if lz4f format is incompatible with regular lz4, we should treat
it as a separate compression method 'lz4f'?

I'm mostly afk until the end of the week, but I tried searching for lz4f
info - the results are not particularly enlightening, unfortunately.

AFAICS this only applies to lz4f stuff. Or would the streaming mode be a
breaking change too?

> One issue is that LZ4F_createCompressionContext() and
> LZ4F_compressBegin() ought to be called in InitCompressorLZ4().  It
> seems like it might *need* to be called there to avoid exactly the kind
> of issue that I reported with empty LOs with gzip.  But
> InitCompressorLZ4() isn't currently passed the ArchiveHandle, so can't
> write the header.  And LZ4CompressorState has a simple char *buf, and
> not an more elaborate data structure like zlib.  You could work around
> that by keeping storing the "len" of the existing buffer, and flushing
> it in EndCompressorLZ4(), but that adds needless complexity to the Write
> and End functions.  Maybe the Init function should be passed the AH.
> 

Not sure, but looking at GzipCompressorState I see the only extra thing
it has (compared to LZ4CompressorState) is "z_streamp". I can't
experiment with this until the end of this week, so perhaps that's not
workable, but wouldn't it be better to add a similar field into
LZ4CompressorState? Passing AH to the init function seems like a
violation of abstraction.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Improve logging when using Huge Pages
Next
From: Nathan Bossart
Date:
Subject: Re: improving user.c error messages