Re: Fix assertion in autovacuum worker - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Fix assertion in autovacuum worker |
Date | |
Msg-id | 20231129024859.kajlgqg33dksonbi@awork3.anarazel.de Whole thread Raw |
In response to | Re: Fix assertion in autovacuum worker (Nathan Bossart <nathandbossart@gmail.com>) |
Responses |
Re: Fix assertion in autovacuum worker
|
List | pgsql-hackers |
Hi, On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote: > On Tue, Nov 28, 2023 at 04:03:49PM -0800, Andres Freund wrote: > > On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote: > >> From a glance, it looks to me like the problem is that pgstat_shutdown_hook > >> is registered as a before_shmem_exit callback, while ProcKill is registered > >> as an on_shmem_exit callback. > > > > That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which you > > can't after ProcKill(). It's also not unique to pgstat, several other > > before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(), > > ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of > > before_shmem_exit when git grepping) - which makes sense, they are presumably > > before_shmem_exit() because they need to manage shared state, which often > > needs locks. > > > > In normal backends this is fine-ish, because ShutdownPostgres() is registered > > very late (and thus is called early in the shutdown sequence), and the > > AbortOutOfAnyTransaction() contained therein indirectly calls > > LWLockReleaseAll() and very little happens outside of the transaction context. > > Right. Perhaps we could add a LWLockReleaseAll() to > pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that > is still just a hack. Yea, we'd need that in just about all before_shmem_exit() callbacks. I could see an argument for doing it in proc_exit_prepare(). While that'd be a fairly gross layering violation, we already do reset a number a bunch of stuff in there: /* * Forget any pending cancel or die requests; we're doing our best to * close up shop already. Note that the signal handlers will not set * these flags again, now that proc_exit_inprogress is set. */ InterruptPending = false; ProcDiePending = false; QueryCancelPending = false; InterruptHoldoffCount = 1; CritSectionCount = 0; > >> I would expect your patch to fix this particular issue, but I'm wondering > >> whether there's a bigger problem here. > > > > Yes, there is - our subsystem initialization, shutdown, error recovery > > infrastructure is a mess. We've interwoven transaction handling far too > > tightly with error handling, the order of subystem initialization is basically > > random and differs between operating systems (due to EXEC_BACKEND) and "mode" > > of execution (the order differs when using single user mode) and we've > > distributed error recovery into ~10 places (all the sigsetjmp()s in backend > > code, xact.c and and a few other places like WalSndErrorCleanup()). > > :( > > I do remember looking into uniting all the various sigsetjmp() calls > before. That could be worth another try. The rest will probably require > additional thought... It'd definitely be worth some effort. I'm quite sure that we have a number of hard to find bugs around this. Greetings, Andres Freund
pgsql-hackers by date: