Re: Add LZ4 compression in pg_dump - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: Add LZ4 compression in pg_dump
Date
Msg-id ZAoGO+/Besr40GLp@telsasoft.com
Whole thread Raw
In response to Re: Add LZ4 compression in pg_dump  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Add LZ4 compression in pg_dump  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On Wed, Mar 01, 2023 at 05:39:54PM +0100, Tomas Vondra wrote:
> On 2/27/23 05:49, Justin Pryzby wrote:
> > On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
> >> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> >>> I have some fixes (attached) and questions while polishing the patch for
> >>> zstd compression.  The fixes are small and could be integrated with the
> >>> patch for zstd, but could be applied independently.
> >>
> >> One more - WriteDataToArchiveGzip() says:
> > 
> > One more again.
> > 
> > The LZ4 path is using non-streaming mode, which compresses each block
> > without persistent state, giving poor compression for -Fc compared with
> > -Fp.  If the data is highly compressible, the difference can be orders
> > of magnitude.
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
> > 12351763
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> > 21890708
> > 
> > That's not true for gzip:
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
> > 2118869
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
> > 2115832
> > 
> > The function ought to at least use streaming mode, so each block/row
> > isn't compressioned in isolation.  003 is a simple patch to use
> > streaming mode, which improves the -Fc case:
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> > 15178283
> > 
> > However, that still flushes the compression buffer, writing a block
> > header, for every row.  With a single-column table, pg_dump -Fc -Z lz4
> > still outputs ~10% *more* data than with no compression at all.  And
> > that's for compressible data.
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
> > 12890296
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
> > 11890296
> > 
> > I think this should use the LZ4F API with frames, which are buffered to
> > avoid outputting a header for every single row.  The LZ4F format isn't
> > compatible with the LZ4 format, so (unlike changing to the streaming
> > API) that's not something we can change in a bugfix release.  I consider
> > this an Opened Item.
> > 
> > With the LZ4F API in 004, -Fp and -Fc are essentially the same size
> > (like gzip).  (Oh, and the output is three times smaller, too.)
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
> > 4155448
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
> > 4156548
> 
> Thanks. Those are definitely interesting improvements/optimizations!
> 
> I suggest we track them as a separate patch series - please add them to
> the CF app (I guess you'll have to add them to 2023-07 at this point,
> but we can get them in, I think).

Thanks for looking.  I'm not sure if I'm the best person to write/submit
the patch to implement that for LZ4.  Georgios, would you want to take
on this change ?

I think that needs to be changed for v16, since 1) LZ4F works so much
better like this, and 2) we can't change it later without breaking
compatibility of the dumpfiles by changing the header with another name
other than "lz4".  Also, I imagine we'd want to continue supporting the
ability to *restore* a dumpfile using the old(current) format, which
would be untestable code unless we also preserved the ability to write
it somehow (like -Z lz4-old).

One issue is that LZ4F_createCompressionContext() and
LZ4F_compressBegin() ought to be called in InitCompressorLZ4().  It
seems like it might *need* to be called there to avoid exactly the kind
of issue that I reported with empty LOs with gzip.  But
InitCompressorLZ4() isn't currently passed the ArchiveHandle, so can't
write the header.  And LZ4CompressorState has a simple char *buf, and
not an more elaborate data structure like zlib.  You could work around
that by keeping storing the "len" of the existing buffer, and flushing
it in EndCompressorLZ4(), but that adds needless complexity to the Write
and End functions.  Maybe the Init function should be passed the AH.

-- 
Justin



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Disable rdns for Kerberos tests
Next
From: Justin Pryzby
Date:
Subject: Re: Improve logging when using Huge Pages