Home > mailing lists

Re: Add LZ4 compression in pg_dump - Mailing list pgsql-hackers

From	Justin Pryzby
Subject	Re: Add LZ4 compression in pg_dump
Date	March 9, 2023 16:15:55
Msg-id	ZAoGO+/Besr40GLp@telsasoft.com Whole thread Raw
In response to	Re: Add LZ4 compression in pg_dump (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses	Re: Add LZ4 compression in pg_dump
List	pgsql-hackers

Tree view

On Wed, Mar 01, 2023 at 05:39:54PM +0100, Tomas Vondra wrote:
> On 2/27/23 05:49, Justin Pryzby wrote:
> > On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
> >> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> >>> I have some fixes (attached) and questions while polishing the patch for
> >>> zstd compression.  The fixes are small and could be integrated with the
> >>> patch for zstd, but could be applied independently.
> >>
> >> One more - WriteDataToArchiveGzip() says:
> > 
> > One more again.
> > 
> > The LZ4 path is using non-streaming mode, which compresses each block
> > without persistent state, giving poor compression for -Fc compared with
> > -Fp.  If the data is highly compressible, the difference can be orders
> > of magnitude.
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
> > 12351763
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> > 21890708
> > 
> > That's not true for gzip:
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
> > 2118869
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
> > 2115832
> > 
> > The function ought to at least use streaming mode, so each block/row
> > isn't compressioned in isolation.  003 is a simple patch to use
> > streaming mode, which improves the -Fc case:
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> > 15178283
> > 
> > However, that still flushes the compression buffer, writing a block
> > header, for every row.  With a single-column table, pg_dump -Fc -Z lz4
> > still outputs ~10% *more* data than with no compression at all.  And
> > that's for compressible data.
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
> > 12890296
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
> > 11890296
> > 
> > I think this should use the LZ4F API with frames, which are buffered to
> > avoid outputting a header for every single row.  The LZ4F format isn't
> > compatible with the LZ4 format, so (unlike changing to the streaming
> > API) that's not something we can change in a bugfix release.  I consider
> > this an Opened Item.
> > 
> > With the LZ4F API in 004, -Fp and -Fc are essentially the same size
> > (like gzip).  (Oh, and the output is three times smaller, too.)
> > 
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
> > 4155448
> > $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
> > 4156548
> 
> Thanks. Those are definitely interesting improvements/optimizations!
> 
> I suggest we track them as a separate patch series - please add them to
> the CF app (I guess you'll have to add them to 2023-07 at this point,
> but we can get them in, I think).

Thanks for looking.  I'm not sure if I'm the best person to write/submit
the patch to implement that for LZ4.  Georgios, would you want to take
on this change ?

I think that needs to be changed for v16, since 1) LZ4F works so much
better like this, and 2) we can't change it later without breaking
compatibility of the dumpfiles by changing the header with another name
other than "lz4".  Also, I imagine we'd want to continue supporting the
ability to *restore* a dumpfile using the old(current) format, which
would be untestable code unless we also preserved the ability to write
it somehow (like -Z lz4-old).

One issue is that LZ4F_createCompressionContext() and
LZ4F_compressBegin() ought to be called in InitCompressorLZ4().  It
seems like it might *need* to be called there to avoid exactly the kind
of issue that I reported with empty LOs with gzip.  But
InitCompressorLZ4() isn't currently passed the ArchiveHandle, so can't
write the header.  And LZ4CompressorState has a simple char *buf, and
not an more elaborate data structure like zlib.  You could work around
that by keeping storing the "len" of the existing buffer, and flushing
it in EndCompressorLZ4(), but that adds needless complexity to the Write
and End functions.  Maybe the Init function should be passed the AH.

-- 
Justin

pgsql-hackers by date:

From: Tom Lane
Date: 09 March 2023, 16:10:32
Subject: Re: Disable rdns for Kerberos tests

From: Justin Pryzby
Date: 09 March 2023, 16:38:56
Subject: Re: Improve logging when using Huge Pages

Re: Add LZ4 compression in pg_dump - Mailing list pgsql-hackers

Previous

Next