Re: zstd compression for pg_dump - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: zstd compression for pg_dump
Date
Msg-id ZCdpzywVGl0BLan2@telsasoft.com
Whole thread Raw
In response to Re: zstd compression for pg_dump  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: zstd compression for pg_dump
List pgsql-hackers
On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
> On 3/27/23 19:28, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >> On 3/16/23 05:50, Justin Pryzby wrote:
> >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> >>>>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>>>> build against it, I had to construct an import library, and put that
> >>>>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>>>> which makes me wonder if I've chosen a harder way than necessary?
> >>>>
> >>>> It looks like pg_dump's meson.build is missing dependencies on zstd
> >>>> (meson couldn't find the headers in the subproject without them).
> >>>
> >>> I saw that this was added for LZ4, but I hadn't added it for zstd since
> >>> I didn't run into an issue without it.  Could you check that what I've
> >>> added works for your case ?
> >>>
> >>>>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>>>> pg_restore output is identical to uncompressed dumps and nothing
> >>>>> explodes. I haven't inspected the threading implementation for safety
> >>>>> yet, as you mentioned.
> >>>>
> >>>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >>>> handling safety for this, by isolating each thread's state. I don't feel
> >>>> comfortable pronouncing this new addition safe or not, because I'm not
> >>>> sure I understand what the comments in the format-specific _Clone()
> >>>> callbacks are saying yet.
> >>>
> >>> My line of reasoning for unix is that pg_dump forks before any calls to
> >>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> >>> doesn't apply to pg_dump under windows.  This is an opened question.  If
> >>> there's no solid answer, I could disable/ignore the option (maybe only
> >>> under windows).
> >>
> >> I may be missing something, but why would the patch affect this? Why
> >> would it even affect safety of the parallel dump? And I don't see any
> >> changes to the clone stuff ...
> > 
> > zstd supports using threads during compression, with -Z zstd:workers=N.
> > When unix forks, the child processes can't do anything to mess up the
> > state of the parent processes.  
> > 
> > But windows pg_dump uses threads instead of forking, so it seems
> > possible that the pg_dump -j threads that then spawn zstd threads could
> > "leak threads" and break the main thread.  I suspect there's no issue,
> > but we still ought to verify that before declaring it safe.
> 
> OK. I don't have access to a Windows machine so I can't test that. Is it
> possible to disable the zstd threading, until we figure this out?

I think that's what's best.  I made it issue a warning if "workers" was
specified.  It could also be an error, or just ignored.

I considered disabling workers only for windows, but realized that I
haven't tested with threads myself - my local zstd package is compiled
without threading, and I remember having some issue recompiling it with
threading.  Jacob's recipe for using meson wraps works well, but it
still seems better to leave it as a future feature.  I used that recipe
to enabled zstd with threading on CI (except for linux/autoconf).

> >>> The function is first checking if it was passed a filename which already
> >>> has a suffix.  And if not, it searches through a list of suffixes,
> >>> testing for an existing file with each suffix.  The search with stat()
> >>> doesn't happen if it has a suffix.  I'm having trouble seeing how the
> >>> hasSuffix() branch isn't dead code.  Another opened question.
> >>
> >> AFAICS it's done this way because of this comment in pg_backup_directory
> >>
> >>  * ...
> >>  * ".gz" suffix is added to the filenames. The TOC files are never
> >>  * compressed by pg_dump, however they are accepted with the .gz suffix
> >>  * too, in case the user has manually compressed them with 'gzip'.
> >>
> >> I haven't tried, but I believe that if you manually compress the
> >> directory, it may hit this branch.
> > 
> > That would make sense, but when I tried, it didn't work like that.
> > The filenames were all uncompressed names.  Maybe it worked differently
> > in an older release.  Or maybe it changed during development of the
> > parallel-directory-dump patch and it's actually dead code.
> 
> Interesting. Would be good to find out. I wonder if a little bit of
> git-log digging could tell us more. Anyway, until we confirm it's dead
> code, we should probably do what .gz does and have the same check for
> .lz4 and .zst files.

I found that hasSuffix() and cfopen() originated in the refactored patch
Heikki's sent here; there's no history beyond that.

https://www.postgresql.org/message-id/4D3954C7.9060503%40enterprisedb.com

The patch published there appends the .gz within cfopen(), and the
caller writes into the TOC the filename without ".gz".  It seems like
maybe a few hours prior, Heikki may have been appending the ".gz" suffix
in the caller, and then writing the TOC with filename.gz.

The only way I've been able to get a "filename.gz" passed to hasSuffix
is to write a directory-format dump, with LOs, and without compression,
and then compress the blobs with "gzip", and *also* edit the blobs.toc
file to say ".gz" (which isn't necessary since, if the original file
isn't found, the restore would search for files with compressed
suffixes).

So .. it's not *technically* unreachable, but I can't see why it'd be
useful to support editing the *content* of the blob TOC (other than
compressing it).  I might give some weight to the idea if it were also
possible to edit the non-blob TOC; but, it's a binary file, so no.

For now, I made the change to make zstd and lz4 to behave the same here
as .gz, unless Heikki has a memory or a git reflog going back far enough
to further support the idea that the code path isn't useful.

I'm going to set the patch as RFC as a hint to anyone who would want to
make a final review.

-- 
Justin

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: regression coverage gaps for gist and hash indexes
Next
From: Tomas Vondra
Date:
Subject: Re: Add LZ4 compression in pg_dump