Re: zstd compression for pg_dump - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: zstd compression for pg_dump
Date
Msg-id 20230303185546.GE12850@telsasoft.com
Whole thread Raw
In response to Re: zstd compression for pg_dump  (Jacob Champion <jchampion@timescale.com>)
Responses Re: zstd compression for pg_dump  (Jacob Champion <jchampion@timescale.com>)
List pgsql-hackers
On Fri, Mar 03, 2023 at 10:32:53AM -0800, Jacob Champion wrote:
> On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > This resolves cfbot warnings: windows and cppcheck.
> > And refactors zstd routines.
> > And updates docs.
> > And includes some fixes for earlier patches that these patches conflicts
> > with/depends on.
> 
> This'll need a rebase (cfbot took a while to catch up).

Soon.

> The patchset includes basebackup modifications, which are part of a
> different CF entry; was that intended?

Yes, it's intentional - if zstd:long mode were to be merged first, then
this patch should include long mode from the start.
Or, if pgdump+zstd were merged first, then long mode could be added to
both places.

> I tried this on a local, 3.5GB, mostly-text table (from the UK Price

Thanks for looking.  If your zstd library is compiled with thread
support, could you also try with :workers=N ?  I believe this is working
correctly, but I'm going to ask for help verifying that...

It'd be especially useful to test under windows, where pgdump/restore
use threads instead of forking...  If you have a windows environment but
not set up for development, I think it's possible to get cirrusci to
compile a patch for you and then retrieve the binaries provided as an
"artifact" (credit/blame for this idea should be directed to Thomas
Munro).

> With this particular dataset, I don't see much improvement with
> zstd:long.

Yeah.  I this could be because either 1) you already got very good
comprssion without looking at more data; and/or 2) the neighboring data
is already very similar, maybe equally or more similar, than the further
data, from which there's nothing to gain.

> (At nearly double the CPU time, I get a <1% improvement in
> compression size.) I assume it's heavily data dependent, but from the
> notes on --long [2] it seems like they expect you to play around with
> the window size to further tailor it to your data. Does it make sense
> to provide the long option without the windowLog parameter?

I don't want to start exposing lots of fine-granined parameters at this
point.  In the immediate case, it looks like it may require more than
just adding another parameter:

              Note: If windowLog is set to larger than 27,
--long=windowLog or --memory=windowSize needs to be passed to the
decompressor.

-- 
Justin



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: libpq: PQgetCopyData() and allocation overhead
Next
From: Jeroen Vermeulen
Date:
Subject: Re: libpq: PQgetCopyData() and allocation overhead