Re: zstd compression for pg_dump - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: zstd compression for pg_dump |
Date | |
Msg-id | 4029fe58-f661-7d8d-cbbb-b936a05531e2@enterprisedb.com Whole thread Raw |
In response to | Re: zstd compression for pg_dump (Justin Pryzby <pryzby@telsasoft.com>) |
Responses |
Re: zstd compression for pg_dump
|
List | pgsql-hackers |
On 4/1/23 01:16, Justin Pryzby wrote: > On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote: >> On 3/27/23 19:28, Justin Pryzby wrote: >>> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: >>>> On 3/16/23 05:50, Justin Pryzby wrote: >>>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: >>>>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote: >>>>>>> I did some smoke testing against zstd's GitHub release on Windows. To >>>>>>> build against it, I had to construct an import library, and put that >>>>>>> and the DLL into the `lib` folder expected by the MSVC scripts... >>>>>>> which makes me wonder if I've chosen a harder way than necessary? >>>>>> >>>>>> It looks like pg_dump's meson.build is missing dependencies on zstd >>>>>> (meson couldn't find the headers in the subproject without them). >>>>> >>>>> I saw that this was added for LZ4, but I hadn't added it for zstd since >>>>> I didn't run into an issue without it. Could you check that what I've >>>>> added works for your case ? >>>>> >>>>>>> Parallel zstd dumps seem to work as expected, in that the resulting >>>>>>> pg_restore output is identical to uncompressed dumps and nothing >>>>>>> explodes. I haven't inspected the threading implementation for safety >>>>>>> yet, as you mentioned. >>>>>> >>>>>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be >>>>>> handling safety for this, by isolating each thread's state. I don't feel >>>>>> comfortable pronouncing this new addition safe or not, because I'm not >>>>>> sure I understand what the comments in the format-specific _Clone() >>>>>> callbacks are saying yet. >>>>> >>>>> My line of reasoning for unix is that pg_dump forks before any calls to >>>>> zstd. Nothing zstd does ought to affect the pg_dump layer. But that >>>>> doesn't apply to pg_dump under windows. This is an opened question. If >>>>> there's no solid answer, I could disable/ignore the option (maybe only >>>>> under windows). >>>> >>>> I may be missing something, but why would the patch affect this? Why >>>> would it even affect safety of the parallel dump? And I don't see any >>>> changes to the clone stuff ... >>> >>> zstd supports using threads during compression, with -Z zstd:workers=N. >>> When unix forks, the child processes can't do anything to mess up the >>> state of the parent processes. >>> >>> But windows pg_dump uses threads instead of forking, so it seems >>> possible that the pg_dump -j threads that then spawn zstd threads could >>> "leak threads" and break the main thread. I suspect there's no issue, >>> but we still ought to verify that before declaring it safe. >> >> OK. I don't have access to a Windows machine so I can't test that. Is it >> possible to disable the zstd threading, until we figure this out? > > I think that's what's best. I made it issue a warning if "workers" was > specified. It could also be an error, or just ignored. > > I considered disabling workers only for windows, but realized that I > haven't tested with threads myself - my local zstd package is compiled > without threading, and I remember having some issue recompiling it with > threading. Jacob's recipe for using meson wraps works well, but it > still seems better to leave it as a future feature. I used that recipe > to enabled zstd with threading on CI (except for linux/autoconf). > +1 to disable this if we're unsure it works correctly. I agree it's better to just error out if workers are requested - I rather dislike when a tool just ignores an explicit parameter. And AFAICS it's what zstd does too, when someone requests workers on incompatible build. FWIW I've been thinking about this a bit more and I don't quite see why would the threading cause issues (except for Windows). I forgot pg_basebackup already supports zstd, including the worker threading, so why would it work there and not in pg_dump? Sure, pg_basebackup is not parallel, but with separate pg_dump processes that shouldn't be an issue (although I'm not sure when zstd creates threads). The one thing I'm wondering about is at which point are the worker threads spawned - but presumably not before the pg_dump processes fork. I'll try building zstd with threading enabled, and do some tests over the weekend. >>>>> The function is first checking if it was passed a filename which already >>>>> has a suffix. And if not, it searches through a list of suffixes, >>>>> testing for an existing file with each suffix. The search with stat() >>>>> doesn't happen if it has a suffix. I'm having trouble seeing how the >>>>> hasSuffix() branch isn't dead code. Another opened question. >>>> >>>> AFAICS it's done this way because of this comment in pg_backup_directory >>>> >>>> * ... >>>> * ".gz" suffix is added to the filenames. The TOC files are never >>>> * compressed by pg_dump, however they are accepted with the .gz suffix >>>> * too, in case the user has manually compressed them with 'gzip'. >>>> >>>> I haven't tried, but I believe that if you manually compress the >>>> directory, it may hit this branch. >>> >>> That would make sense, but when I tried, it didn't work like that. >>> The filenames were all uncompressed names. Maybe it worked differently >>> in an older release. Or maybe it changed during development of the >>> parallel-directory-dump patch and it's actually dead code. >> >> Interesting. Would be good to find out. I wonder if a little bit of >> git-log digging could tell us more. Anyway, until we confirm it's dead >> code, we should probably do what .gz does and have the same check for >> .lz4 and .zst files. > > I found that hasSuffix() and cfopen() originated in the refactored patch > Heikki's sent here; there's no history beyond that. > > https://www.postgresql.org/message-id/4D3954C7.9060503%40enterprisedb.com > > The patch published there appends the .gz within cfopen(), and the > caller writes into the TOC the filename without ".gz". It seems like > maybe a few hours prior, Heikki may have been appending the ".gz" suffix > in the caller, and then writing the TOC with filename.gz. > > The only way I've been able to get a "filename.gz" passed to hasSuffix > is to write a directory-format dump, with LOs, and without compression, > and then compress the blobs with "gzip", and *also* edit the blobs.toc > file to say ".gz" (which isn't necessary since, if the original file > isn't found, the restore would search for files with compressed > suffixes). > > So .. it's not *technically* unreachable, but I can't see why it'd be > useful to support editing the *content* of the blob TOC (other than > compressing it). I might give some weight to the idea if it were also > possible to edit the non-blob TOC; but, it's a binary file, so no. > > For now, I made the change to make zstd and lz4 to behave the same here > as .gz, unless Heikki has a memory or a git reflog going back far enough > to further support the idea that the code path isn't useful. > Makes sense. Let's keep the same behavior for all compression methods, and if it's unreachable we shall remove it from all. It's a trivial amount of code, we can live with that. > I'm going to set the patch as RFC as a hint to anyone who would want to > make a final review. > OK. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: