On 2/27/23 05:49, Justin Pryzby wrote:
> On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
>> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
>>> I have some fixes (attached) and questions while polishing the patch for
>>> zstd compression. The fixes are small and could be integrated with the
>>> patch for zstd, but could be applied independently.
>>
>> One more - WriteDataToArchiveGzip() says:
>
> One more again.
>
> The LZ4 path is using non-streaming mode, which compresses each block
> without persistent state, giving poor compression for -Fc compared with
> -Fp. If the data is highly compressible, the difference can be orders
> of magnitude.
>
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
> 12351763
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> 21890708
>
> That's not true for gzip:
>
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
> 2118869
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
> 2115832
>
> The function ought to at least use streaming mode, so each block/row
> isn't compressioned in isolation. 003 is a simple patch to use
> streaming mode, which improves the -Fc case:
>
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> 15178283
>
> However, that still flushes the compression buffer, writing a block
> header, for every row. With a single-column table, pg_dump -Fc -Z lz4
> still outputs ~10% *more* data than with no compression at all. And
> that's for compressible data.
>
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
> 12890296
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
> 11890296
>
> I think this should use the LZ4F API with frames, which are buffered to
> avoid outputting a header for every single row. The LZ4F format isn't
> compatible with the LZ4 format, so (unlike changing to the streaming
> API) that's not something we can change in a bugfix release. I consider
> this an Opened Item.
>
> With the LZ4F API in 004, -Fp and -Fc are essentially the same size
> (like gzip). (Oh, and the output is three times smaller, too.)
>
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
> 4155448
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
> 4156548
>
Thanks. Those are definitely interesting improvements/optimizations!
I suggest we track them as a separate patch series - please add them to
the CF app (I guess you'll have to add them to 2023-07 at this point,
but we can get them in, I think).
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company